feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)

Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.

1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
  (sshd ignores SSH signal-channel requests by default), so a cancel of a
  long-running badblocks left the remote process running and proc.wait()
  hanging — pinning the asyncio.Semaphore slot forever.

  * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
    capture the remote PID; store in burnin._remote_pids[job_id].
  * burnin._kill_remote_process(job_id) opens a fresh SSH session and
    issues `kill -9 <pid>` — sshd honours that.
  * Bound proc.wait() with asyncio.wait_for(timeout=15).
  * burnin._active_tasks tracks every _run_job task so cancel_job and
    check_stuck_jobs can actually cancel the asyncio task (was DB-only
    before). Also fixes the documented asyncio.create_task GC gotcha
    (weak refs only).
  * _run_job finalizer reads current state and skips the write if state
    != 'running' so cancelled/unknown aren't clobbered.

1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
  health/poll timestamps; devname/serial/model/size_bytes were stuck at
  first-INSERT values forever. After kernel SCSI re-enumeration two
  drives could both show as `sda`. Fixed by updating all six fields.
  Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
  off the dashboard while audit/burnin_jobs FKs stay intact.

1.0.0-15/-16 — pool-membership lock.
  * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
    the flattened TrueNAS output (container vdevs + their device children
    both appear at depth 1; section markers cache/log/spare/special/dedup
    switch the role).
  * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
    to detect drives carrying ZFS labels not in any active pool — they
    get pool_name='(exported)', pool_role='exported'.
  * Three idempotent ALTER TABLE migrations on drives:
    pool_name/pool_role/pool_seen_at.
  * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
    the drive isn't in burnin._unlock_grants. Routes layer maps to 409
    with structured detail {pool_name, pool_role, pool_locked: true} so
    the frontend can render an unlock affordance.
  * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
    reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
    for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
    chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
    pool_drive_unlocked / boot_pool_drive_unlocked /
    exported_pool_drive_unlocked.
  * Grants are in-memory only — container restart wipes them.
  * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
    Burn-In button. modal_unlock.html with type-to-confirm field.
    Live unlock countdown via tickUnlockCountdowns() in app.js.
  * Daily report: red banner listing every unlock event from the last
    24h, with operator + reason + timestamp.

1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
  * ssh_client.get_pool_membership / get_zfs_member_drives now return
    None on failure (vs {} for 'definitely empty'). poller passes
    update_pool=False to _upsert_drive on detection failure, preserving
    existing pool columns instead of clearing them. Without this fix a
    1-second SSH blip silently unlocked every drive.
  * mailer._build_unlock_banner_html escapes every interpolated field
    via html.escape() (was '<' only). Time filter switched to
    julianday() — string >= against datetime('now', '-1 day') compared
    formats with different separators ('T' vs ' ') and timezone
    suffixes, causing subtle off-by-N-hour inclusion.
  * app.js submitStart/submitBatchStart now detect the structured
    pool_locked 409 detail and auto-open the unlock modal for the
    offending drive (was [object Object] in toast).

1.0.0-18 — Codex grant-binding + commit-ordering fixes.
  * Unlock grants bound to the (pool_name, pool_role) observed at unlock
    time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
    invalidate the grant if the live row's pool identity has changed.
    Prevents an 'exported' unlock from carrying over when the drive
    turns out to be in active 'tank' or 'boot-pool'.
  * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
    succeeds — previously a failed audit insert left an unaudited grant
    armed.

1.0.0-19 — Codex race + cancellation classification + test scaffold.
  * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
    (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
    try/except aiosqlite.IntegrityError -> ValueError so the read-then-
    insert race in start_job can't produce two queued rows for the same
    drive.
  * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
    future code paths) where DB state is still 'running', finalizer
    writes 'unknown' instead of mis-classifying as 'failed'.
  * tests/ stdlib unittest scaffold:
    - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
      single-disk depth-1, plural section markers, partition stripping,
      sdaa-style names, multi-pool, role reset between pools.
    - test_unlock_flow.py (18 tests): token validation per pool kind,
      identity-binding invalidation, TTL expiry, audit-commit-then-arm
      ordering, unique-active-burnin partial index.
    Run via `python -m unittest discover tests/`. No new dependencies.

1.0.0-20 — Spearfoot-inspired badblocks tunables.
  * surface_validate_block_size (-b, default 4096), surface_validate_
    block_buffer (-c, default 64), surface_validate_passes (-p, default
    1) exposed in Settings UI; persist via settings_store.json.
    Validation: block size must be a power of 2 between 512 and
    1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
    roughly halves runtime on multi-TB HDDs at ~2x RAM cost.

1.0.0-21 — SMART overall-health column actually populated.
  * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
    to UNKNOWN forever (only burn-in stages ever wrote a real value).
  * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
    all drives in a single SSH session, deterministically delimited with
    @@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
    UNKNOWN} or None on SSH failure.
  * poller calls it every 5th cycle (~1 min at default 12s interval),
    caches in _state['smart_health_cache'] so transient failures preserve
    the previous values.
  * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
    -> 6 so Short/Long SMART columns fit comfortably on a 13-inch
    display.
  * 5 additional parser tests (44 total, all passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Brandon Walter 2026-05-02 09:25:56 -04:00
parent b85bac7686
commit 5da1a1704f
18 changed files with 2623 additions and 311 deletions

View file

@ -19,6 +19,7 @@ Cancellation:
import asyncio import asyncio
import logging import logging
import time import time
from contextlib import asynccontextmanager
from datetime import datetime, timezone from datetime import datetime, timezone
import aiosqlite import aiosqlite
@ -66,14 +67,29 @@ POLL_INTERVAL = 5.0 # seconds between progress checks during active stages
_semaphore: asyncio.Semaphore | None = None _semaphore: asyncio.Semaphore | None = None
_client: TrueNASClient | None = None _client: TrueNASClient | None = None
# Live job tracking — keeps a strong reference to every _run_job task so it
# isn't garbage-collected (asyncio.create_task only keeps a weak ref) and so
# cancel_job / check_stuck_jobs can actually unwedge a stuck task.
_active_tasks: dict[int, "asyncio.Task"] = {}
# Remote PID of any long-running SSH child process (currently only badblocks)
# so we can kill it via a fresh SSH session — proc.kill() over asyncssh sends
# a "signal" channel request that OpenSSH sshd ignores by default, leaving
# the remote process running and proc.wait() hanging forever.
_remote_pids: dict[int, int] = {}
def _now() -> str: def _now() -> str:
return datetime.now(timezone.utc).isoformat() return datetime.now(timezone.utc).isoformat()
def _db(): @asynccontextmanager
"""Open a fresh WAL-mode connection. Caller must use 'async with'.""" async def _db():
return aiosqlite.connect(settings.db_path) """Open a WAL-mode connection with busy_timeout so writers wait for the lock
instead of immediately raising 'database is locked' under contention."""
async with aiosqlite.connect(settings.db_path) as db:
await db.execute("PRAGMA busy_timeout=10000")
yield db
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@ -104,11 +120,228 @@ async def init(client: TrueNASClient) -> None:
await db.commit() await db.commit()
for job_id in queued: for job_id in queued:
asyncio.create_task(_run_job(job_id)) _spawn_run_job(job_id)
log.info("Burn-in orchestrator ready (max_concurrent=%d)", settings.max_parallel_burnins) log.info("Burn-in orchestrator ready (max_concurrent=%d)", settings.max_parallel_burnins)
def _spawn_run_job(job_id: int) -> "asyncio.Task":
"""Schedule a _run_job task and keep a strong reference to it.
Plain asyncio.create_task() only leaves a weak reference behind, so the
task can be GC'd before it ever runs. Storing it in _active_tasks also
lets cancel_job / check_stuck_jobs cancel it directly.
"""
task = asyncio.create_task(_run_job(job_id))
_active_tasks[job_id] = task
def _cleanup(t: "asyncio.Task") -> None:
# Remove only if it's still us — avoid clobbering a re-enqueued task.
if _active_tasks.get(job_id) is t:
_active_tasks.pop(job_id, None)
_remote_pids.pop(job_id, None)
task.add_done_callback(_cleanup)
return task
async def _kill_remote_process(job_id: int) -> None:
"""Send kill -9 to the remote PID associated with this job, if any.
asyncssh's proc.kill() sends an SSH 'signal' channel request which
OpenSSH's sshd does not honor by default. Opening a fresh session and
running /bin/kill is the reliable way to actually terminate the process.
"""
pid = _remote_pids.pop(job_id, None)
if not pid:
return
try:
from app import ssh_client
async with await ssh_client._connect() as conn:
await asyncio.wait_for(
conn.run(f"kill -9 {pid} 2>/dev/null || true", check=False),
timeout=10,
)
log.info("Remote-killed PID %d for job %d", pid, job_id)
except Exception as exc:
log.warning("Failed to remote-kill PID %d for job %d: %s", pid, job_id, exc)
# ---------------------------------------------------------------------------
# Pool-drive unlock state
# ---------------------------------------------------------------------------
#
# Drives that ZFS reports as belonging to an active zpool (including the
# boot pool) are locked from burn-in until the operator explicitly unlocks
# them via POST /api/v1/drives/{id}/unlock. Grants live in memory only —
# a container restart wipes them, which is the right default for "this is
# very dangerous." TTL is bounded so an unlock you forgot about can't sit
# armed indefinitely.
import time as _time
from dataclasses import dataclass
UNLOCK_TTL_SECONDS = 600 # 10 minutes
BOOT_POOL_NAME = "boot-pool"
BOOT_POOL_CONFIRM_TOKEN = "DESTROY BOOT POOL"
EXPORTED_POOL_ROLE = "exported"
EXPORTED_CONFIRM_TOKEN = "DESTROY EXPORTED POOL"
@dataclass
class _UnlockGrant:
"""An operator-issued, time-bounded permission to burn-in a pool drive.
The grant is BOUND to the (pool_name, pool_role) observed at unlock
time. If a subsequent poll reclassifies the drive e.g. it was
"(exported)" when unlocked but is now in active pool "tank", or it
used to be a cache vdev and now shows as data the grant is
invalidated. Otherwise the operator's "I confirm this exported drive
is decommissioned" judgement would silently authorise destruction
of a live pool.
"""
expiry: float
pool_name: str
pool_role: str | None
_unlock_grants: dict[int, _UnlockGrant] = {}
class PoolMemberError(Exception):
"""Raised by start_job when a drive is in a zpool and not unlocked."""
def __init__(self, drive_id: int, pool_name: str, pool_role: str | None):
self.drive_id = drive_id
self.pool_name = pool_name
self.pool_role = pool_role
is_boot = pool_name == BOOT_POOL_NAME
super().__init__(
f"Drive is part of {'BOOT POOL' if is_boot else 'pool'} "
f"'{pool_name}'{' (' + pool_role + ')' if pool_role else ''}. "
f"Unlock required before burn-in."
)
def _is_unlocked(drive_id: int, current_pool_name: str | None,
current_pool_role: str | None) -> bool:
"""True iff a non-expired grant exists AND the drive's pool identity
matches what was observed at unlock time."""
grant = _unlock_grants.get(drive_id)
if grant is None:
return False
if _time.time() >= grant.expiry:
_unlock_grants.pop(drive_id, None)
return False
if grant.pool_name != current_pool_name or grant.pool_role != current_pool_role:
# Pool identity changed since unlock — drive may now belong to a
# different (or live) pool. Invalidate the grant; operator must
# re-unlock with eyes-open against the current state.
_unlock_grants.pop(drive_id, None)
log.warning(
"Invalidating unlock grant for drive_id=%d: pool changed from "
"(%s, %s) to (%s, %s)",
drive_id, grant.pool_name, grant.pool_role,
current_pool_name, current_pool_role,
)
return False
return True
def unlock_expiry(drive_id: int, current_pool_name: str | None,
current_pool_role: str | None) -> float | None:
"""Return the absolute expiry of an active grant, or None.
Same identity-binding semantics as _is_unlocked: a grant whose stored
pool identity no longer matches the current row is treated as expired
and reaped. This is what the dashboard reads to decide whether to show
the unlocked-Burn-In affordance vs the locked-Unlock affordance.
"""
grant = _unlock_grants.get(drive_id)
if grant is None:
return None
if _time.time() >= grant.expiry:
_unlock_grants.pop(drive_id, None)
return None
if grant.pool_name != current_pool_name or grant.pool_role != current_pool_role:
_unlock_grants.pop(drive_id, None)
return None
return grant.expiry
async def grant_pool_unlock(drive_id: int, confirm_token: str,
operator: str, reason: str) -> float:
"""Validate confirmation token + reason and grant a time-limited unlock.
Raises ValueError on bad confirm_token, missing reason, or drive not
actually in a pool. Returns the unix expiry timestamp on success.
"""
if not reason or len(reason.strip()) < 5:
raise ValueError("A reason of at least 5 characters is required.")
if not operator or not operator.strip():
raise ValueError("Operator name is required.")
async with _db() as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT pool_name, pool_role, devname FROM drives WHERE id=?",
(drive_id,),
)
row = await cur.fetchone()
if not row:
raise ValueError("Drive not found.")
pool_name = row["pool_name"]
pool_role = row["pool_role"]
if not pool_name:
raise ValueError(
"This drive is not part of any pool — no unlock needed."
)
# Boot-pool and exported pools both get dedicated, harder-to-fat-
# finger tokens. Active data pools just need their pool name typed.
if pool_name == BOOT_POOL_NAME:
expected = BOOT_POOL_CONFIRM_TOKEN
elif pool_role == EXPORTED_POOL_ROLE:
expected = EXPORTED_CONFIRM_TOKEN
else:
expected = pool_name
if (confirm_token or "").strip() != expected:
raise ValueError("Confirmation token does not match.")
if pool_name == BOOT_POOL_NAME:
evt = "boot_pool_drive_unlocked"
elif pool_role == EXPORTED_POOL_ROLE:
evt = "exported_pool_drive_unlocked"
else:
evt = "pool_drive_unlocked"
await db.execute(
"""INSERT INTO audit_events
(event_type, drive_id, burnin_job_id, operator, message)
VALUES (?,?,?,?,?)""",
(evt, drive_id, None, operator.strip(),
f"Unlocked {pool_name} drive {row['devname']} for burn-in: {reason.strip()}"),
)
await db.commit()
# Arm the in-memory grant ONLY after the audit row is durable. If the
# commit above raises, we exit without writing _unlock_grants — no
# unaudited active unlocks. The grant is bound to the (pool_name,
# pool_role) we observed under the open transaction so a later poll
# that reclassifies the drive invalidates it (see _is_unlocked).
expiry = _time.time() + UNLOCK_TTL_SECONDS
_unlock_grants[drive_id] = _UnlockGrant(
expiry=expiry,
pool_name=pool_name,
pool_role=pool_role,
)
log.warning(
"Pool-drive unlock granted: drive_id=%d pool=%s role=%s "
"operator=%s reason=%r",
drive_id, pool_name, pool_role, operator, reason,
)
return expiry
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Public API # Public API
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@ -142,13 +375,35 @@ async def start_job(drive_id: int, profile: str, operator: str,
if (await cur.fetchone())[0] > 0: if (await cur.fetchone())[0] > 0:
raise ValueError("Drive already has an active burn-in job") raise ValueError("Drive already has an active burn-in job")
# Create job # Pool-membership gate: locked unless the operator explicitly
# unlocked this drive via /api/v1/drives/{id}/unlock recently.
# _is_unlocked also checks that the grant's stored (pool_name,
# pool_role) still matches the live row — a grant issued for an
# exported drive doesn't carry over if the drive turns out to be
# in an active pool on the next poll.
cur = await db.execute(
"SELECT pool_name, pool_role FROM drives WHERE id=?", (drive_id,)
)
drow = await cur.fetchone()
if drow and drow["pool_name"] and not _is_unlocked(
drive_id, drow["pool_name"], drow["pool_role"]
):
raise PoolMemberError(drive_id, drow["pool_name"], drow["pool_role"])
# Create job. The partial unique index uniq_active_burnin_per_drive
# (database.py) is the actual race-stopper here: if two concurrent
# /api/v1/burnin/start calls both pass the SELECT-COUNT check above,
# only one INSERT can win; the loser raises IntegrityError, which
# we surface with the same ValueError as the inline duplicate check.
try:
cur = await db.execute( cur = await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at) """INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?) RETURNING id""", VALUES (?,?,?,?,?,?) RETURNING id""",
(drive_id, profile, "queued", 0, operator, now), (drive_id, profile, "queued", 0, operator, now),
) )
job_id = (await cur.fetchone())["id"] job_id = (await cur.fetchone())["id"]
except aiosqlite.IntegrityError:
raise ValueError("Drive already has an active burn-in job")
# Create stage rows in the desired execution order # Create stage rows in the desired execution order
for stage_name in stages: for stage_name in stages:
@ -164,7 +419,7 @@ async def start_job(drive_id: int, profile: str, operator: str,
) )
await db.commit() await db.commit()
asyncio.create_task(_run_job(job_id)) _spawn_run_job(job_id)
log.info("Burn-in job %d queued (drive_id=%d profile=%s operator=%s)", log.info("Burn-in job %d queued (drive_id=%d profile=%s operator=%s)",
job_id, drive_id, profile, operator) job_id, drive_id, profile, operator)
return job_id return job_id
@ -198,6 +453,13 @@ async def cancel_job(job_id: int, operator: str) -> bool:
) )
await db.commit() await db.commit()
# Kill the remote child process FIRST (so proc.wait() in the running task
# can return), then cancel the task so any other awaits unblock.
await _kill_remote_process(job_id)
task = _active_tasks.get(job_id)
if task and not task.done():
task.cancel()
log.info("Burn-in job %d cancelled by %s", job_id, operator) log.info("Burn-in job %d cancelled by %s", job_id, operator)
return True return True
@ -206,10 +468,45 @@ async def cancel_job(job_id: int, operator: str) -> bool:
# Job runner # Job runner
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
async def _thermal_gate_ok() -> bool:
"""True if it's thermally safe to start a new burn-in.
Checks the peak temperature of drives currently under active burn-in.
"""
try:
async with _db() as db:
cur = await db.execute("""
SELECT MAX(d.temperature_c)
FROM drives d
JOIN burnin_jobs bj ON bj.drive_id = d.id
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
""")
row = await cur.fetchone()
max_temp = row[0] if row and row[0] is not None else None
return max_temp is None or max_temp < settings.temp_warn_c
except Exception:
return True # Never block on error
async def _run_job(job_id: int) -> None: async def _run_job(job_id: int) -> None:
"""Acquire semaphore slot, execute all stages, persist final state.""" """Acquire semaphore slot, execute all stages, persist final state."""
assert _semaphore is not None, "burnin.init() not called" assert _semaphore is not None, "burnin.init() not called"
# Adaptive thermal gate: wait before competing for a slot if running drives
# are already at or above the warning threshold. This prevents layering a
# new burn-in on top of a thermally-stressed system. Gives up after 3 min
# and proceeds anyway so jobs don't queue indefinitely.
for _attempt in range(18): # 18 × 10 s = 3 min max
if await _thermal_gate_ok():
break
if _attempt == 0:
log.info(
"Thermal gate: job %d waiting — running drive temps at or above %d°C",
job_id, settings.temp_warn_c,
)
await asyncio.sleep(10)
else:
log.warning("Thermal gate timed out for job %d — proceeding anyway", job_id)
async with _semaphore: async with _semaphore:
if await _is_cancelled(job_id): if await _is_cancelled(job_id):
return return
@ -254,17 +551,34 @@ async def _run_job(job_id: int) -> None:
success = False success = False
error_text = None error_text = None
was_cancelled = False
try: try:
success = await _execute_stages(job_id, job_stages, devname, drive_id) success = await _execute_stages(job_id, job_stages, devname, drive_id)
except asyncio.CancelledError: except asyncio.CancelledError:
pass was_cancelled = True
except Exception as exc: except Exception as exc:
error_text = str(exc) error_text = str(exc)
log.exception("Burn-in raised exception", extra={"job_id": job_id, "devname": devname}) log.exception("Burn-in raised exception", extra={"job_id": job_id, "devname": devname})
if await _is_cancelled(job_id): # If the job has already moved to a terminal state — by cancel_job
# ('cancelled') or check_stuck_jobs ('unknown') — leave it alone. The
# task may have been cancelled mid-stage; finalizing as 'failed' would
# clobber that audit-meaningful terminal state.
async with _db() as db:
cur = await db.execute("SELECT state FROM burnin_jobs WHERE id=?", (job_id,))
cur_row = await cur.fetchone()
if cur_row and cur_row[0] != "running":
return return
# Cancellation arriving here means the asyncio task was cancelled
# by something other than cancel_job/check_stuck_jobs (shutdown,
# uvicorn reload, future code paths). The DB still says 'running',
# so we have to write *some* terminal state, but classifying the
# interrupted job as 'failed' would lie — we don't actually know
# whether the underlying SMART/badblocks work passed or not.
if was_cancelled:
final_state = "unknown"
else:
final_state = "passed" if success else "failed" final_state = "passed" if success else "failed"
async with _db() as db: async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
@ -464,6 +778,14 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
# Brief pause to let the test register in smartctl output # Brief pause to let the test register in smartctl output
await asyncio.sleep(3) await asyncio.sleep(3)
# Throttle log_text appends — every poll on a multi-hour long_smart bloated
# log_text to 50+ MB and triggered SQLite "database is locked" because each
# COALESCE-then-append rewrites the whole column. Append every ~60s, on the
# first poll, and on any state change.
LOG_EVERY_N_POLLS = 12
poll_count = 0
last_state: str | None = None
# Poll until complete # Poll until complete
while True: while True:
if await _is_cancelled(job_id): if await _is_cancelled(job_id):
@ -482,6 +804,10 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
await _append_stage_log(job_id, stage_name, f"[poll error] {exc}\n") await _append_stage_log(job_id, stage_name, f"[poll error] {exc}\n")
continue continue
poll_count += 1
state_changed = progress["state"] != last_state
last_state = progress["state"]
if poll_count == 1 or poll_count % LOG_EVERY_N_POLLS == 0 or state_changed:
await _append_stage_log(job_id, stage_name, progress["output"] + "\n---\n") await _append_stage_log(job_id, stage_name, progress["output"] + "\n---\n")
if progress["state"] == "running": if progress["state"] == "running":
@ -519,15 +845,39 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
# "unknown" → keep polling # "unknown" → keep polling
async def _badblocks_available() -> bool:
"""Check if badblocks is installed on the remote host (Linux/SCALE only)."""
from app import ssh_client
try:
async with await ssh_client._connect() as conn:
result = await conn.run("which badblocks", check=False)
return result.returncode == 0
except Exception:
return False
async def _stage_surface_validate(job_id: int, devname: str, drive_id: int) -> bool: async def _stage_surface_validate(job_id: int, devname: str, drive_id: int) -> bool:
""" """
Surface validation stage. Surface validation stage auto-routes to the right implementation:
SSH mode: runs badblocks -wsv -b 4096 -p 1 /dev/{devname}.
Mock mode: simulated timed progress (no real I/O). 1. SSH configured + badblocks available (TrueNAS SCALE / Linux):
runs badblocks -wsv -b 4096 -p 1 /dev/{devname} directly over SSH.
2. SSH configured + badblocks NOT available (TrueNAS CORE / FreeBSD):
uses TrueNAS REST API disk.wipe FULL job + post-wipe SMART check.
3. No SSH:
simulated timed progress (dev/mock mode).
""" """
from app import ssh_client from app import ssh_client
if ssh_client.is_configured(): if ssh_client.is_configured():
if await _badblocks_available():
return await _stage_surface_validate_ssh(job_id, devname, drive_id) return await _stage_surface_validate_ssh(job_id, devname, drive_id)
# TrueNAS CORE/FreeBSD: badblocks not available — use native wipe API
await _append_stage_log(
job_id, "surface_validate",
"[INFO] badblocks not found on host (TrueNAS CORE/FreeBSD) — "
"using TrueNAS disk.wipe API (FULL write pass).\n\n"
)
return await _stage_surface_validate_truenas(job_id, devname, drive_id)
return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds) return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds)
@ -537,8 +887,11 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
await _append_stage_log( await _append_stage_log(
job_id, "surface_validate", job_id, "surface_validate",
f"[START] badblocks -wsv -b 4096 -p 1 /dev/{devname}\n" f"[START] badblocks -wsv -b {settings.surface_validate_block_size} "
f"[NOTE] This is a DESTRUCTIVE write test. All data on /dev/{devname} will be overwritten.\n\n" f"-c {settings.surface_validate_block_buffer} "
f"-p {settings.surface_validate_passes} /dev/{devname}\n"
f"[NOTE] This is a DESTRUCTIVE write test. "
f"All data on /dev/{devname} will be overwritten.\n\n"
) )
def _is_cancelled_sync() -> bool: def _is_cancelled_sync() -> bool:
@ -580,14 +933,50 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
output_lines: list[str] = [] output_lines: list[str] = []
async with await ssh_client._connect() as conn: async with await ssh_client._connect() as conn:
cmd = f"badblocks -wsv -b 4096 -p 1 /dev/{devname}" # Wrap in `sh -c 'echo PID:$$; exec ...'` so we get the remote
# PID on the first stdout line. asyncssh's proc.kill() sends an
# SSH signal request that OpenSSH's sshd ignores by default, so
# we need the PID to issue an out-of-band `kill -9` over a fresh
# session when we want to abort.
#
# Block geometry is operator-tunable (Settings → Burn-in):
# -b N block size in bytes (settings.surface_validate_block_size)
# -c N blocks held per IO (settings.surface_validate_block_buffer)
# -p N pass count (settings.surface_validate_passes)
# Defaults preserve original behavior (-b 4096 -c 64 -p 1).
bb_args = (
f"-wsv "
f"-b {settings.surface_validate_block_size} "
f"-c {settings.surface_validate_block_buffer} "
f"-p {settings.surface_validate_passes}"
)
cmd = (
f"sh -c 'echo PID:$$; exec badblocks {bb_args} /dev/{devname}'"
)
async with conn.create_process(cmd) as proc: async with conn.create_process(cmd) as proc:
import re as _re import re as _re
pid_seen = False
async def _drain(stream, is_stderr: bool): async def _drain(stream, is_stderr: bool):
nonlocal bad_blocks_total nonlocal bad_blocks_total, pid_seen
async for raw in stream: async for raw in stream:
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace") line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
# First stdout line is "PID:<n>" from the wrapping shell.
# Capture it and don't append it to the user-visible log.
if not is_stderr and not pid_seen and line.startswith("PID:"):
pid_seen = True
try:
_remote_pids[job_id] = int(line[4:].strip())
log.info(
"Captured remote PID %d for job %d (badblocks)",
_remote_pids[job_id], job_id,
)
except ValueError:
pass
continue
output_lines.append(line) output_lines.append(line)
if is_stderr: if is_stderr:
@ -610,7 +999,7 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
# Abort on bad block threshold # Abort on bad block threshold
if bad_blocks_total > settings.bad_block_threshold: if bad_blocks_total > settings.bad_block_threshold:
proc.kill() await _kill_remote_process(job_id)
output_lines.append( output_lines.append(
f"\n[ABORTED] {bad_blocks_total} bad block(s) exceeded " f"\n[ABORTED] {bad_blocks_total} bad block(s) exceeded "
f"threshold ({settings.bad_block_threshold})\n" f"threshold ({settings.bad_block_threshold})\n"
@ -618,7 +1007,7 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
return return
if await _is_cancelled(job_id): if await _is_cancelled(job_id):
proc.kill() await _kill_remote_process(job_id)
return return
await asyncio.gather( await asyncio.gather(
@ -626,7 +1015,17 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
_drain(proc.stderr, True), _drain(proc.stderr, True),
return_exceptions=True, return_exceptions=True,
) )
await proc.wait() # Bound proc.wait so a remote process that ignored our kill
# signal (or that we never managed to kill) can't pin this
# task in the semaphore forever. Closing the connection on
# exit will deliver SIGPIPE to the remote on its next write.
try:
await asyncio.wait_for(proc.wait(), timeout=15)
except asyncio.TimeoutError:
log.warning(
"proc.wait() timed out for job %d — abandoning channel",
job_id,
)
# Flush remaining output # Flush remaining output
remainder = "".join(output_lines) remainder = "".join(output_lines)
@ -655,6 +1054,116 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
return True return True
async def _stage_surface_validate_truenas(job_id: int, devname: str, drive_id: int) -> bool:
"""
Surface validation via TrueNAS CORE disk.wipe REST API.
Used on FreeBSD (TrueNAS CORE) where badblocks is unavailable.
Sends a FULL write-zero pass across the entire disk, polls progress,
then runs a post-wipe SMART attribute check to catch reallocated sectors.
"""
from app import ssh_client
await _append_stage_log(
job_id, "surface_validate",
f"[START] TrueNAS disk.wipe FULL — {devname}\n"
f"[NOTE] DESTRUCTIVE: all data on {devname} will be overwritten.\n\n"
)
# Start the wipe job
try:
tn_job_id = await _client.wipe_disk(devname, "FULL")
except Exception as exc:
await _set_stage_error(job_id, "surface_validate", f"Failed to start disk.wipe: {exc}")
return False
await _append_stage_log(
job_id, "surface_validate",
f"[JOB] TrueNAS wipe job started (job_id={tn_job_id})\n"
)
# Poll until complete
log_flush_counter = 0
while True:
if await _is_cancelled(job_id):
try:
await _client.abort_job(tn_job_id)
except Exception:
pass
return False
await asyncio.sleep(POLL_INTERVAL)
try:
job = await _client.get_job(tn_job_id)
except Exception as exc:
log.warning("Wipe job poll failed: %s", exc, extra={"job_id": job_id})
await _append_stage_log(job_id, "surface_validate", f"[poll error] {exc}\n")
continue
if not job:
await _set_stage_error(job_id, "surface_validate", f"Wipe job {tn_job_id} not found")
return False
state = job.get("state", "")
pct = int(job.get("progress", {}).get("percent", 0) or 0)
desc = job.get("progress", {}).get("description", "")
await _update_stage_percent(job_id, "surface_validate", min(pct, 99))
await _recalculate_progress(job_id)
_push_update()
# Log progress description every ~5 polls to avoid DB spam
log_flush_counter += 1
if desc and log_flush_counter % 5 == 0:
await _append_stage_log(job_id, "surface_validate", f"[{pct}%] {desc}\n")
if state == "SUCCESS":
await _update_stage_percent(job_id, "surface_validate", 100)
await _append_stage_log(
job_id, "surface_validate",
f"\n[DONE] Wipe job {tn_job_id} completed successfully.\n"
)
# Post-wipe SMART check — catch any sectors that failed under write stress
if ssh_client.is_configured() and drive_id is not None:
await _append_stage_log(
job_id, "surface_validate",
"[CHECK] Running post-wipe SMART attribute check...\n"
)
try:
attrs = await ssh_client.get_smart_attributes(devname)
await _store_smart_attrs(drive_id, attrs)
if attrs["failures"]:
error = "Post-wipe SMART check: " + "; ".join(attrs["failures"])
await _set_stage_error(job_id, "surface_validate", error)
return False
if attrs["warnings"]:
await _append_stage_log(
job_id, "surface_validate",
"[WARNING] " + "; ".join(attrs["warnings"]) + "\n"
)
await _append_stage_log(
job_id, "surface_validate",
f"[CHECK] SMART health: {attrs['health']} — no critical attributes.\n"
)
except Exception as exc:
log.warning("Post-wipe SMART check failed: %s", exc)
await _append_stage_log(
job_id, "surface_validate",
f"[WARN] Post-wipe SMART check failed (non-fatal): {exc}\n"
)
return True
elif state in ("FAILED", "ABORTED", "ERROR"):
error_msg = job.get("error") or f"Disk wipe failed (state={state})"
await _set_stage_error(
job_id, "surface_validate",
f"TrueNAS disk.wipe FAILED: {error_msg}"
)
return False
# RUNNING or WAITING — keep polling
async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds: int) -> bool: async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds: int) -> bool:
"""Simulate a timed stage with progress updates (mock / dev mode).""" """Simulate a timed stage with progress updates (mock / dev mode)."""
start = time.monotonic() start = time.monotonic()
@ -681,21 +1190,47 @@ async def _stage_final_check(job_id: int, devname: str, drive_id: int | None = N
Verify drive passed all tests. Verify drive passed all tests.
SSH mode: run smartctl -a and check critical attributes. SSH mode: run smartctl -a and check critical attributes.
Mock mode: check SMART health field in DB. Mock mode: check SMART health field in DB.
A transient SSH connectivity failure here must NOT invalidate a prior
multi-day surface_validate. Retry SSH-only failures, then soft-pass.
""" """
await asyncio.sleep(1) await asyncio.sleep(1)
from app import ssh_client from app import ssh_client
def _ssh_only(failures: list[str]) -> bool:
return bool(failures) and all(f.startswith("SSH error:") for f in failures)
if ssh_client.is_configured() and drive_id is not None: if ssh_client.is_configured() and drive_id is not None:
try: try:
attrs = await ssh_client.get_smart_attributes(devname) attrs = await ssh_client.get_smart_attributes(devname)
for attempt in range(2):
if not _ssh_only(attrs.get("failures") or []):
break
log.warning(
"final_check SSH unreachable (attempt %d/3); retrying in 30s",
attempt + 1,
extra={"job_id": job_id, "devname": devname},
)
await asyncio.sleep(30)
attrs = await ssh_client.get_smart_attributes(devname)
failures = attrs.get("failures") or []
if _ssh_only(failures):
log.warning(
"final_check soft-pass: SSH unreachable after retries; prior stages stand",
extra={"job_id": job_id, "devname": devname, "ssh_error": failures},
)
return True
await _store_smart_attrs(drive_id, attrs) await _store_smart_attrs(drive_id, attrs)
if attrs["health"] == "FAILED" or attrs["failures"]: if attrs["health"] == "FAILED" or failures:
failures = attrs["failures"] or [f"SMART health: {attrs['health']}"] msg = failures or [f"SMART health: {attrs['health']}"]
await _set_stage_error(job_id, "final_check", await _set_stage_error(job_id, "final_check",
"Final check failed: " + "; ".join(failures)) "Final check failed: " + "; ".join(msg))
return False return False
return True return True
except Exception as exc: except Exception as exc:
log.warning("SSH final_check failed, falling back to DB check: %s", exc) log.warning("SSH final_check raised, falling back to DB check: %s", exc)
# DB check (mock mode fallback) # DB check (mock mode fallback)
async with _db() as db: async with _db() as db:
@ -942,6 +1477,11 @@ async def check_stuck_jobs() -> None:
"UPDATE burnin_jobs SET state='unknown', finished_at=? WHERE id=?", "UPDATE burnin_jobs SET state='unknown', finished_at=? WHERE id=?",
(now, job_id), (now, job_id),
) )
await db.execute(
"""UPDATE burnin_stages SET state='unknown', finished_at=?
WHERE burnin_job_id=? AND state='running'""",
(now, job_id),
)
await db.execute( await db.execute(
"""INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message) """INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message)
VALUES (?,?,?,?,?)""", VALUES (?,?,?,?,?)""",
@ -951,5 +1491,16 @@ async def check_stuck_jobs() -> None:
await db.commit() await db.commit()
# Actually unstick the running tasks so they release their semaphore slot.
# Without this the DB state becomes 'unknown' but the asyncio task keeps
# holding the slot forever — which is the bug that left subsequent jobs
# permanently 'queued' until container restart.
for row in stuck:
job_id = row[0]
await _kill_remote_process(job_id)
task = _active_tasks.get(job_id)
if task and not task.done():
task.cancel()
_push_update() _push_update()
log.warning("Marked %d stuck job(s) as unknown", len(stuck)) log.warning("Marked %d stuck job(s) as unknown", len(stuck))

View file

@ -58,6 +58,21 @@ class Settings(BaseSettings):
# Bad-block tolerance — surface_validate fails if bad blocks exceed this # Bad-block tolerance — surface_validate fails if bad blocks exceed this
bad_block_threshold: int = 0 bad_block_threshold: int = 0
# Surface-validate (badblocks) tunables — defaults match the Spearfoot
# disk-burnin.sh community script's recommended geometry for large HDDs.
# block_size : -b in bytes; aligned to AF (4 KiB) sectors. Bumping
# to 8192 roughly halves badblocks runtime on multi-TB
# drives at the cost of ~2x RAM in the test buffer.
# block_buffer : -c blocks held in memory per IO. 64 = badblocks
# default. Higher values = larger buffer, faster IO,
# more RAM (block_size * block_buffer bytes per pass).
# passes : -p value. 1 = repeat until one consecutive clean
# scan (current behavior). 2-3 for paranoid burn-in
# that re-confirms after finding errors.
surface_validate_block_size: int = 4096
surface_validate_block_buffer: int = 64
surface_validate_passes: int = 1
# SSH credentials for direct TrueNAS command execution (Stage 7) # SSH credentials for direct TrueNAS command execution (Stage 7)
# When ssh_host is set, burn-in stages use SSH for smartctl/badblocks instead of REST API. # When ssh_host is set, burn-in stages use SSH for smartctl/badblocks instead of REST API.
# Leave ssh_host empty to use the mock/REST API (development mode). # Leave ssh_host empty to use the mock/REST API (development mode).
@ -68,7 +83,7 @@ class Settings(BaseSettings):
ssh_key: str = "" # PEM private key content (paste full key including headers) ssh_key: str = "" # PEM private key content (paste full key including headers)
# Application version — used by the /api/v1/updates/check endpoint # Application version — used by the /api/v1/updates/check endpoint
app_version: str = "1.0.0-7" app_version: str = "1.0.0-21"
settings = Settings() settings = Settings()

View file

@ -89,6 +89,16 @@ _MIGRATIONS = [
"ALTER TABLE smart_tests ADD COLUMN raw_output TEXT", "ALTER TABLE smart_tests ADD COLUMN raw_output TEXT",
# Stage 8: track last reset time so dashboard burn-in col clears after reset # Stage 8: track last reset time so dashboard burn-in col clears after reset
"ALTER TABLE drives ADD COLUMN last_reset_at TEXT", "ALTER TABLE drives ADD COLUMN last_reset_at TEXT",
# 1.0.0-15: pool-membership lock
"ALTER TABLE drives ADD COLUMN pool_name TEXT",
"ALTER TABLE drives ADD COLUMN pool_role TEXT",
"ALTER TABLE drives ADD COLUMN pool_seen_at TEXT",
# 1.0.0-19: enforce one active burn-in per drive at the storage layer.
# Closes the read-then-insert race in burnin.start_job — without this,
# two concurrent /api/v1/burnin/start requests for the same drive could
# both observe zero active jobs and both insert queued rows.
"""CREATE UNIQUE INDEX IF NOT EXISTS uniq_active_burnin_per_drive
ON burnin_jobs (drive_id) WHERE state IN ('queued', 'running')""",
] ]

View file

@ -5,6 +5,7 @@ Disabled when SMTP_HOST is not set.
""" """
import asyncio import asyncio
import html
import logging import logging
import smtplib import smtplib
import ssl import ssl
@ -109,17 +110,61 @@ def _drive_rows_html(drives: list[dict]) -> str:
return "\n".join(rows) return "\n".join(rows)
def _build_html(drives: list[dict], generated_at: str) -> str: def _build_unlock_banner_html(events: list[dict]) -> str:
"""Banner listing every pool-drive unlock granted in the last 24h.
Every interpolated DB field is run through html.escape operator and
reason are free-text from the unlock modal and otherwise inject into
the email body verbatim.
"""
if not events:
return ""
rows = []
for e in events:
evt = e.get("event_type") or ""
is_boot = evt == "boot_pool_drive_unlocked"
is_exported = evt == "exported_pool_drive_unlocked"
kind = (
"BOOT POOL" if is_boot
else "EXPORTED ZFS" if is_exported
else "pool"
)
when = html.escape((e.get("created_at") or "")[:19])
operator = html.escape(e.get("operator") or "?")
devname = html.escape(e.get("devname") or "?")
# `message` already includes pool name, devname, and the operator's
# reason — surface it verbatim so the audit trail is faithful.
message = html.escape(e.get("message") or "")
rows.append(
f"<li style='margin:4px 0'><strong>{when}</strong> &middot; "
f"<strong>{operator}</strong> unlocked a {kind} drive "
f"({devname}): "
f"<span style='color:#c9d1d9'>{message}</span></li>"
)
return f"""
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;
padding:14px 18px;margin-bottom:20px;color:#f85149">
<div style="font-weight:600;font-size:14px;margin-bottom:6px">
&#x26A0; {len(events)} pool-drive unlock(s) in the last 24h
</div>
<ul style="margin:0;padding-left:18px;font-size:12.5px;color:#f0a0a0">
{''.join(rows)}
</ul>
</div>"""
def _build_html(drives: list[dict], generated_at: str,
unlock_events: list[dict] | None = None) -> str:
total = len(drives) total = len(drives)
failed_drives = [d for d in drives if d.get("smart_health") == "FAILED"] failed_drives = [d for d in drives if d.get("smart_health") == "FAILED"]
running_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "running"] running_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "running"]
passed_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "passed"] passed_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "passed"]
# Alert banner # Alert banners (unlock events first — the audit-grade signal)
alert_html = "" alert_html = _build_unlock_banner_html(unlock_events or [])
if failed_drives: if failed_drives:
names = ", ".join(d["devname"] for d in failed_drives) names = ", ".join(d["devname"] for d in failed_drives)
alert_html = f""" alert_html += f"""
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;padding:14px 18px;margin-bottom:20px;color:#f85149;font-weight:500"> <div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;padding:14px 18px;margin-bottom:20px;color:#f85149;font-weight:500">
SMART health FAILED on {len(failed_drives)} drive(s): {names} SMART health FAILED on {len(failed_drives)} drive(s): {names}
</div>""" </div>"""
@ -287,6 +332,36 @@ async def _fetch_report_data() -> list[dict]:
return await _fetch_drives_for_template(db) return await _fetch_drives_for_template(db)
async def _fetch_unlock_events_24h() -> list[dict]:
"""Return pool-drive unlock audit events from the last 24 hours.
These are operator overrides of the pool-membership lock every entry
represents a deliberate decision to risk a pool, so the daily report
surfaces them as an audit-grade banner.
"""
async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL")
# julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format
# we write from Python; comparing the raw string against
# datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS')
# produces subtle off-by-up-to-a-day errors because of the
# 'T' vs ' ' separator and the '+00:00' suffix.
cur = await db.execute("""
SELECT ae.event_type, ae.operator, ae.message, ae.created_at,
d.devname, d.pool_name, d.pool_role
FROM audit_events ae
LEFT JOIN drives d ON d.id = ae.drive_id
WHERE ae.event_type IN (
'pool_drive_unlocked',
'boot_pool_drive_unlocked',
'exported_pool_drive_unlocked')
AND julianday(ae.created_at) >= julianday('now', '-1 day')
ORDER BY ae.created_at DESC
""")
return [dict(r) for r in await cur.fetchall()]
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Scheduler # Scheduler
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@ -411,9 +486,16 @@ async def test_smtp_connection() -> dict:
async def send_report_now() -> None: async def send_report_now() -> None:
"""Send a report immediately (used by on-demand API endpoint).""" """Send a report immediately (used by on-demand API endpoint)."""
drives = await _fetch_report_data() drives = await _fetch_report_data()
unlock_events = await _fetch_unlock_events_24h()
now_str = datetime.now().strftime("%Y-%m-%d %H:%M") now_str = datetime.now().strftime("%Y-%m-%d %H:%M")
html = _build_html(drives, now_str) html = _build_html(drives, now_str, unlock_events)
subject = f"Burn-In Report — {datetime.now().strftime('%Y-%m-%d')} ({len(drives)} drives)" suffix = ""
if unlock_events:
suffix = f"{len(unlock_events)} pool unlock(s)"
subject = (
f"Burn-In Report — {datetime.now().strftime('%Y-%m-%d')} "
f"({len(drives)} drives){suffix}"
)
await asyncio.to_thread(_send_email, subject, html) await asyncio.to_thread(_send_email, subject, html)

View file

@ -97,8 +97,17 @@ class DriveResponse(BaseModel):
smart_long: SmartTestState smart_long: SmartTestState
notes: str | None = None notes: str | None = None
location: str | None = None location: str | None = None
pool_name: str | None = None
pool_role: str | None = None
pool_unlocked_until: float | None = None # unix epoch; null = locked
class UpdateDriveRequest(BaseModel): class UpdateDriveRequest(BaseModel):
notes: str | None = None notes: str | None = None
location: str | None = None location: str | None = None
class UnlockPoolDriveRequest(BaseModel):
confirm_token: str
operator: str
reason: str

View file

@ -20,13 +20,15 @@ from app.truenas import TrueNASClient
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
# Shared state read by the /health endpoint # Shared state read by the /health endpoint and dashboard template
_state: dict[str, Any] = { _state: dict[str, Any] = {
"last_poll_at": None, "last_poll_at": None,
"last_error": None, "last_error": None,
"healthy": False, "healthy": False,
"drives_seen": 0, "drives_seen": 0,
"consecutive_failures": 0, "consecutive_failures": 0,
"system_temps": {}, # {"cpu_c": int|None, "pch_c": int|None}
"thermal_pressure": "ok", # "ok" | "warn" | "crit" — based on running burn-in drive temps
} }
# SSE subscriber queues — notified after each successful poll # SSE subscriber queues — notified after each successful poll
@ -87,18 +89,60 @@ def _map_history_state(status: str) -> str:
# DB helpers # DB helpers
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str) -> int: async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str,
await db.execute( pool_info: dict | None = None,
update_pool: bool = True) -> int:
"""Insert/update a drive row.
pool_info: {"pool": str, "role": str} if this drive is currently in a
zpool, else None. None values clear pool columns so a removed-from-pool
drive doesn't stay locked.
update_pool: when False, pool columns are preserved on conflict and
initialised to NULL on insert. Callers pass False on detection failure
so a transient SSH outage doesn't silently unlock every drive.
""" """
INSERT INTO drives pool_name = pool_info["pool"] if pool_info else None
(truenas_disk_id, devname, serial, model, size_bytes, pool_role = pool_info["role"] if pool_info else None
temperature_c, smart_health, last_seen_at, last_polled_at) pool_seen_at = now if pool_info else None
VALUES (?,?,?,?,?,?,?,?,?)
ON CONFLICT(truenas_disk_id) DO UPDATE SET if update_pool:
update_clause = """
devname = excluded.devname,
serial = excluded.serial,
model = excluded.model,
size_bytes = excluded.size_bytes,
temperature_c = excluded.temperature_c,
smart_health = excluded.smart_health,
last_seen_at = excluded.last_seen_at,
last_polled_at = excluded.last_polled_at,
pool_name = excluded.pool_name,
pool_role = excluded.pool_role,
pool_seen_at = excluded.pool_seen_at
"""
else:
# Preserve pool_name / pool_role / pool_seen_at — detection failed
# this cycle, so we have no fresh data and must not overwrite.
update_clause = """
devname = excluded.devname,
serial = excluded.serial,
model = excluded.model,
size_bytes = excluded.size_bytes,
temperature_c = excluded.temperature_c, temperature_c = excluded.temperature_c,
smart_health = excluded.smart_health, smart_health = excluded.smart_health,
last_seen_at = excluded.last_seen_at, last_seen_at = excluded.last_seen_at,
last_polled_at = excluded.last_polled_at last_polled_at = excluded.last_polled_at
"""
await db.execute(
f"""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
ON CONFLICT(truenas_disk_id) DO UPDATE SET
{update_clause}
""", """,
( (
disk["identifier"], disk["identifier"],
@ -110,6 +154,9 @@ async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str) -> int:
disk.get("smart_health", "UNKNOWN"), disk.get("smart_health", "UNKNOWN"),
now, now,
now, now,
pool_name,
pool_role,
pool_seen_at,
), ),
) )
cur = await db.execute( cur = await db.execute(
@ -208,6 +255,67 @@ async def _sync_history(
# Poll cycle # Poll cycle
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
async def _poll_smart_via_ssh(db: aiosqlite.Connection, now: str) -> None:
"""
Poll progress for SMART tests started via SSH (truenas_job_id IS NULL).
Used on TrueNAS SCALE 25.10+ where the REST smart/test API no longer exists.
"""
from app import ssh_client
if not ssh_client.is_configured():
return
cur = await db.execute(
"""SELECT st.id, st.test_type, st.drive_id, d.devname, st.started_at
FROM smart_tests st
JOIN drives d ON d.id = st.drive_id
WHERE st.state = 'running' AND st.truenas_job_id IS NULL"""
)
rows = await cur.fetchall()
if not rows:
return
for row in rows:
test_id, ttype, drive_id, devname, started_at = row[0], row[1], row[2], row[3], row[4]
try:
progress = await ssh_client.poll_smart_progress(devname)
except Exception as exc:
log.warning("SSH SMART poll failed for %s: %s", devname, exc)
continue
state = progress["state"]
pct_remaining = progress.get("percent_remaining") # None = not yet in output
raw_output = progress.get("output", "")
if state == "running":
# pct_remaining=None means smartctl output doesn't have the % line yet
# (test just started) — keep percent at 0 rather than jumping to 100
if pct_remaining is None:
pct = 0
else:
pct = max(0, 100 - pct_remaining)
eta = _eta_from_progress(pct, started_at)
await db.execute(
"UPDATE smart_tests SET percent=?, eta_at=?, raw_output=? WHERE id=?",
(pct, eta, raw_output, test_id),
)
elif state == "passed":
await db.execute(
"UPDATE smart_tests SET state='passed', percent=100, finished_at=?, raw_output=? WHERE id=?",
(now, raw_output, test_id),
)
log.info("SSH SMART %s passed on %s", ttype, devname)
elif state == "failed":
await db.execute(
"UPDATE smart_tests SET state='failed', percent=0, finished_at=?, "
"error_text=?, raw_output=? WHERE id=?",
(now, f"SMART {ttype.upper()} test failed", raw_output, test_id),
)
log.warning("SSH SMART %s FAILED on %s", ttype, devname)
# state == "unknown" → keep polling, no update
await db.commit()
async def poll_cycle(client: TrueNASClient) -> int: async def poll_cycle(client: TrueNASClient) -> int:
"""Run one full poll. Returns number of drives seen.""" """Run one full poll. Returns number of drives seen."""
now = _now() now = _now()
@ -215,6 +323,88 @@ async def poll_cycle(client: TrueNASClient) -> int:
disks = await client.get_disks() disks = await client.get_disks()
running_jobs = await client.get_smart_jobs(state="RUNNING") running_jobs = await client.get_smart_jobs(state="RUNNING")
# Fetch temperatures via SCALE-specific endpoint.
# CORE doesn't have this endpoint — silently skip on any error.
try:
temps = await client.get_disk_temperatures()
except Exception:
temps = {}
# Inject temperature into each disk dict (SCALE 25.10 has no temp in /disk)
for disk in disks:
devname = disk.get("devname", "")
t = temps.get(devname)
if t is not None:
disk["temperature"] = int(round(t))
# SMART health — TrueNAS /api/v2.0/disk doesn't expose smart_health,
# so without this every drive defaults to UNKNOWN forever (only burn-in
# stages used to populate it). Run `smartctl -H` over a single SSH
# session for every drive every Nth cycle. Cache between cycles via
# _state so the dashboard always renders the most recent answer.
SMART_HEALTH_EVERY_N_CYCLES = 5 # ~1 minute at default 12s interval
_state.setdefault("smart_health_cache", {})
cycle_n = _state.get("cycle", 0) + 1
_state["cycle"] = cycle_n
try:
from app import ssh_client as _ssh
if _ssh.is_configured() and (cycle_n % SMART_HEALTH_EVERY_N_CYCLES == 1):
health_map = await _ssh.get_smart_health_map(
[d["devname"] for d in disks if d.get("devname")]
)
if health_map is not None:
_state["smart_health_cache"] = health_map
except Exception as exc:
log.warning("smart_health refresh failed: %s", exc)
health_cache = _state.get("smart_health_cache") or {}
for disk in disks:
devname = disk.get("devname", "")
h = health_cache.get(devname)
if h:
disk["smart_health"] = h
# Pool membership map — drives in any zpool are locked from burn-in.
# ssh_client returns None on failure (distinct from {} which means "no
# pools"). If EITHER detection call fails we fail-closed: leave
# pool_name / pool_role columns alone so previously-locked drives stay
# locked, and previously-unlocked drives stay unlocked, until detection
# recovers. Treating a transient SSH blip as "no pool members" would
# silently unlock every drive on the next poll.
detection_ok = True
pool_map: dict = {}
zfs_member_set: set = set()
try:
from app import ssh_client as _ssh
if _ssh.is_configured():
pm = await _ssh.get_pool_membership()
zs = await _ssh.get_zfs_member_drives()
if pm is None or zs is None:
detection_ok = False
else:
pool_map = pm
zfs_member_set = zs
# SSH unconfigured (mock/dev mode) — detection_ok stays True with
# empty maps, so dev mode never artificially locks drives.
except Exception:
detection_ok = False
if not detection_ok:
log.warning(
"Pool detection failed this cycle — preserving existing "
"pool_name/pool_role columns. Locked drives stay locked, "
"unlocked drives stay unlocked, until SSH recovers."
)
if detection_ok:
# Drives carrying ZFS labels but not in any active pool are
# "exported" — same hazard as an active pool member, so lock them
# too. We don't know the original pool name without
# `zpool import`-style scanning (slow + blocks); display
# "(exported)" and use a special token.
for devname in zfs_member_set:
if devname not in pool_map:
pool_map[devname] = {"pool": "(exported)", "role": "exported"}
# Index running jobs by (devname, test_type) # Index running jobs by (devname, test_type)
active: dict[tuple[str, str], dict] = {} active: dict[tuple[str, str], dict] = {}
for job in running_jobs: for job in running_jobs:
@ -233,7 +423,11 @@ async def poll_cycle(client: TrueNASClient) -> int:
for disk in disks: for disk in disks:
devname = disk["devname"] devname = disk["devname"]
drive_id = await _upsert_drive(db, disk, now) drive_id = await _upsert_drive(
db, disk, now,
pool_map.get(devname) if detection_ok else None,
update_pool=detection_ok,
)
for ttype in ("short", "long"): for ttype in ("short", "long"):
if (devname, ttype) in active: if (devname, ttype) in active:
@ -243,6 +437,9 @@ async def poll_cycle(client: TrueNASClient) -> int:
await db.commit() await db.commit()
# SSH SMART polling — for tests started via smartctl (no TrueNAS REST job)
await _poll_smart_via_ssh(db, now)
return len(disks) return len(disks)
@ -263,6 +460,39 @@ async def run(client: TrueNASClient) -> None:
_state["drives_seen"] = count _state["drives_seen"] = count
_state["consecutive_failures"] = 0 _state["consecutive_failures"] = 0
log.debug("Poll OK", extra={"drives": count}) log.debug("Poll OK", extra={"drives": count})
# System sensor temps via SSH (non-fatal)
from app import ssh_client as _ssh
if _ssh.is_configured():
try:
_state["system_temps"] = await _ssh.get_system_sensors()
except Exception:
pass
# Thermal pressure: max temp of drives currently under burn-in
try:
async with aiosqlite.connect(settings.db_path) as _tdb:
_tdb.row_factory = aiosqlite.Row
await _tdb.execute("PRAGMA journal_mode=WAL")
_cur = await _tdb.execute("""
SELECT MAX(d.temperature_c)
FROM drives d
JOIN burnin_jobs bj ON bj.drive_id = d.id
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
""")
_row = await _cur.fetchone()
_max_t = _row[0] if _row and _row[0] is not None else None
if _max_t is None:
_state["thermal_pressure"] = "ok"
elif _max_t >= settings.temp_crit_c:
_state["thermal_pressure"] = "crit"
elif _max_t >= settings.temp_warn_c:
_state["thermal_pressure"] = "warn"
else:
_state["thermal_pressure"] = "ok"
except Exception:
_state["thermal_pressure"] = "ok"
_notify_subscribers() _notify_subscribers()
# Check for stuck jobs every 5 cycles (~1 min at default 12s interval) # Check for stuck jobs every 5 cycles (~1 min at default 12s interval)

View file

@ -15,7 +15,8 @@ from app.database import get_db
from app.models import ( from app.models import (
BurninJobResponse, BurninStageResponse, BurninJobResponse, BurninStageResponse,
CancelBurninRequest, DriveResponse, CancelBurninRequest, DriveResponse,
SmartTestState, StartBurninRequest, UpdateDriveRequest, SmartTestState, StartBurninRequest, UnlockPoolDriveRequest,
UpdateDriveRequest,
) )
from app.renderer import templates from app.renderer import templates
@ -48,6 +49,22 @@ def _is_stale(last_polled_at: str) -> bool:
return True return True
def _compute_eta_seconds(started_at: str | None, percent: int) -> int | None:
"""Linear ETA extrapolation from started_at and percent complete."""
if not started_at or percent <= 0:
return None
try:
start = datetime.fromisoformat(started_at)
if start.tzinfo is None:
start = start.replace(tzinfo=timezone.utc)
elapsed = (datetime.now(timezone.utc) - start).total_seconds()
total_est = elapsed / (percent / 100)
remaining = max(0, int(total_est - elapsed))
return remaining
except Exception:
return None
def _build_smart(row: aiosqlite.Row, prefix: str) -> SmartTestState: def _build_smart(row: aiosqlite.Row, prefix: str) -> SmartTestState:
eta_at = row[f"{prefix}_eta_at"] eta_at = row[f"{prefix}_eta_at"]
return SmartTestState( return SmartTestState(
@ -76,6 +93,11 @@ def _row_to_drive(row: aiosqlite.Row) -> DriveResponse:
smart_long=_build_smart(row, "long"), smart_long=_build_smart(row, "long"),
notes=row["notes"], notes=row["notes"],
location=row["location"], location=row["location"],
pool_name=row["pool_name"],
pool_role=row["pool_role"],
pool_unlocked_until=burnin.unlock_expiry(
row["id"], row["pool_name"], row["pool_role"],
),
) )
@ -96,7 +118,7 @@ _DRIVES_QUERY = """
SELECT SELECT
d.id, d.devname, d.serial, d.model, d.size_bytes, d.id, d.devname, d.serial, d.model, d.size_bytes,
d.temperature_c, d.smart_health, d.last_polled_at, d.temperature_c, d.smart_health, d.last_polled_at,
d.notes, d.location, d.notes, d.location, d.pool_name, d.pool_role,
s.state AS short_state, s.state AS short_state,
s.percent AS short_percent, s.percent AS short_percent,
s.started_at AS short_started_at, s.started_at AS short_started_at,
@ -112,6 +134,7 @@ _DRIVES_QUERY = """
FROM drives d FROM drives d
LEFT JOIN smart_tests s ON s.drive_id = d.id AND s.test_type = 'short' LEFT JOIN smart_tests s ON s.drive_id = d.id AND s.test_type = 'short'
LEFT JOIN smart_tests l ON l.drive_id = d.id AND l.test_type = 'long' LEFT JOIN smart_tests l ON l.drive_id = d.id AND l.test_type = 'long'
WHERE d.last_seen_at >= datetime('now', '-7 days')
{where} {where}
ORDER BY d.devname ORDER BY d.devname
""" """
@ -138,11 +161,55 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
cur = await db.execute(_DRIVES_QUERY.format(where="")) cur = await db.execute(_DRIVES_QUERY.format(where=""))
rows = await cur.fetchall() rows = await cur.fetchall()
burnin_by_drive = await _fetch_burnin_by_drive(db) burnin_by_drive = await _fetch_burnin_by_drive(db)
# For burn-ins that include SMART stages, fetch those stages so we can
# mirror their progress/result in the Short/Long SMART columns.
# This covers both running stages (showing live progress) and completed
# stages (showing passed/failed after the burn-in moves to the next stage).
bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row}
bi_ids_with_smart = [
bi["id"] for bi in burnin_by_drive.values()
if bi["state"] in ("running", "queued")
]
if bi_ids_with_smart:
placeholders = ",".join("?" * len(bi_ids_with_smart))
cur = await db.execute(f"""
SELECT bs.burnin_job_id, bs.stage_name, bs.state, bs.percent,
bs.started_at, bs.finished_at, bs.error_text
FROM burnin_stages bs
WHERE bs.burnin_job_id IN ({placeholders})
AND bs.stage_name IN ('short_smart', 'long_smart')
AND bs.state IN ('running', 'passed', 'failed')
""", bi_ids_with_smart)
for r in await cur.fetchall():
bi_smart_stages.setdefault(r["burnin_job_id"], {})[r["stage_name"]] = dict(r)
drives = [] drives = []
for row in rows: for row in rows:
d = _row_to_drive(row).model_dump() d = _row_to_drive(row).model_dump()
d["status"] = _compute_status(d) d["status"] = _compute_status(d)
d["burnin"] = burnin_by_drive.get(d["id"]) bi = burnin_by_drive.get(d["id"])
d["burnin"] = bi
# Overlay burn-in SMART stage progress/results onto the SMART columns
if bi and bi["id"] in bi_smart_stages:
for stage_name, stage in bi_smart_stages[bi["id"]].items():
target = "smart_short" if stage_name == "short_smart" else "smart_long"
# Only overlay if the standalone SMART column is idle/empty
existing = d.get(target) or {}
if existing.get("state") not in (None, "idle"):
continue
pct = stage["percent"] or 0
d[target] = {
"state": stage["state"],
"percent": pct if stage["state"] == "running" else (100 if stage["state"] == "passed" else 0),
"eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage["state"] == "running" else None,
"eta_timestamp": None,
"started_at": stage["started_at"],
"finished_at": stage["finished_at"],
"error_text": stage["error_text"],
}
drives.append(d) drives.append(d)
return drives return drives
@ -170,7 +237,7 @@ def _stale_context(poller_state: dict) -> dict:
async def dashboard(request: Request, db: aiosqlite.Connection = Depends(get_db)): async def dashboard(request: Request, db: aiosqlite.Connection = Depends(get_db)):
drives = await _fetch_drives_for_template(db) drives = await _fetch_drives_for_template(db)
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("dashboard.html", { return templates.TemplateResponse(request, "dashboard.html", {
"request": request, "request": request,
"drives": drives, "drives": drives,
"poller": ps, "poller": ps,
@ -218,6 +285,18 @@ async def sse_drives(request: Request):
yield {"event": "drives-update", "data": html} yield {"event": "drives-update", "data": html}
# Push system sensor state so JS can update temp chips live
ps = poller.get_state()
yield {
"event": "system-sensors",
"data": json.dumps({
"system_temps": ps.get("system_temps", {}),
"thermal_pressure": ps.get("thermal_pressure", "ok"),
"temp_warn_c": settings.temp_warn_c,
"temp_crit_c": settings.temp_crit_c,
}),
}
# Push browser notification event if this was a job completion # Push browser notification event if this was a job completion
if alert: if alert:
yield {"event": "job-alert", "data": json.dumps(alert)} yield {"event": "job-alert", "data": json.dumps(alert)}
@ -258,7 +337,7 @@ async def list_drives(db: aiosqlite.Connection = Depends(get_db)):
@router.get("/api/v1/drives/{drive_id}/drawer") @router.get("/api/v1/drives/{drive_id}/drawer")
async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)): async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
"""Data for the log drawer — latest burn-in job + stages, SMART tests, audit events.""" """Data for the log drawer — latest burn-in job + stages, SMART tests, audit events."""
cur = await db.execute(_DRIVES_QUERY.format(where="WHERE d.id = ?"), (drive_id,)) cur = await db.execute(_DRIVES_QUERY.format(where="AND d.id = ?"), (drive_id,))
row = await cur.fetchone() row = await cur.fetchone()
if not row: if not row:
raise HTTPException(status_code=404, detail="Drive not found") raise HTTPException(status_code=404, detail="Drive not found")
@ -339,7 +418,7 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
@router.get("/api/v1/drives/{drive_id}", response_model=DriveResponse) @router.get("/api/v1/drives/{drive_id}", response_model=DriveResponse)
async def get_drive(drive_id: int, db: aiosqlite.Connection = Depends(get_db)): async def get_drive(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
cur = await db.execute( cur = await db.execute(
_DRIVES_QUERY.format(where="WHERE d.id = ?"), (drive_id,) _DRIVES_QUERY.format(where="AND d.id = ?"), (drive_id,)
) )
row = await cur.fetchone() row = await cur.fetchone()
if not row: if not row:
@ -353,9 +432,13 @@ async def smart_start(
body: dict, body: dict,
db: aiosqlite.Connection = Depends(get_db), db: aiosqlite.Connection = Depends(get_db),
): ):
"""Start a standalone SHORT or LONG SMART test on a single drive.""" """Start a standalone SHORT or LONG SMART test on a single drive.
from app.truenas import TrueNASClient
from app import burnin as _burnin Uses SSH (smartctl) when configured required for TrueNAS SCALE 25.10+
where the REST smart/test endpoint no longer exists.
Falls back to TrueNAS REST API for older versions.
"""
from app import burnin as _burnin, ssh_client
test_type = (body.get("type") or "").upper() test_type = (body.get("type") or "").upper()
if test_type not in ("SHORT", "LONG"): if test_type not in ("SHORT", "LONG"):
@ -367,16 +450,41 @@ async def smart_start(
raise HTTPException(status_code=404, detail="Drive not found") raise HTTPException(status_code=404, detail="Drive not found")
devname = row[0] devname = row[0]
# Use the shared TrueNAS client held by the burnin module now = datetime.now(timezone.utc).isoformat()
ttype_lower = test_type.lower()
if ssh_client.is_configured():
# SSH path — works on TrueNAS SCALE 25.10+ and CORE
try:
output = await ssh_client.start_smart_test(devname, test_type)
except Exception as exc:
raise HTTPException(status_code=502, detail=f"SSH error: {exc}")
# Mark as running in DB (truenas_job_id=NULL signals SSH-managed test)
# Store smartctl start output as proof the test was initiated
await db.execute(
"""INSERT INTO smart_tests (drive_id, test_type, state, percent, started_at, raw_output)
VALUES (?,?,?,?,?,?)
ON CONFLICT(drive_id, test_type) DO UPDATE SET
state='running', percent=0, truenas_job_id=NULL,
started_at=excluded.started_at, finished_at=NULL, error_text=NULL,
raw_output=excluded.raw_output""",
(drive_id, ttype_lower, "running", 0, now, output),
)
await db.commit()
from app import poller as _poller
_poller._notify_subscribers()
return {"devname": devname, "type": test_type, "message": output[:200]}
else:
# REST path — older TrueNAS CORE / SCALE versions
client = _burnin._client client = _burnin._client
if client is None: if client is None:
raise HTTPException(status_code=503, detail="TrueNAS client not ready") raise HTTPException(status_code=503, detail="TrueNAS client not ready")
try: try:
tn_job_id = await client.start_smart_test([devname], test_type) tn_job_id = await client.start_smart_test([devname], test_type)
except Exception as exc: except Exception as exc:
raise HTTPException(status_code=502, detail=f"TrueNAS error: {exc}") raise HTTPException(status_code=502, detail=f"TrueNAS error: {exc}")
return {"job_id": tn_job_id, "devname": devname, "type": test_type} return {"job_id": tn_job_id, "devname": devname, "type": test_type}
@ -403,7 +511,16 @@ async def smart_cancel(
if client is None: if client is None:
raise HTTPException(status_code=503, detail="TrueNAS client not ready") raise HTTPException(status_code=503, detail="TrueNAS client not ready")
# Find the running TrueNAS job for this drive/test-type from app import ssh_client
if ssh_client.is_configured():
# SSH path — abort via smartctl -X
try:
await ssh_client.abort_smart_test(devname)
except Exception as exc:
raise HTTPException(status_code=502, detail=f"SSH abort error: {exc}")
else:
# REST path — find TrueNAS job and abort it
try: try:
jobs = await client.get_smart_jobs() jobs = await client.get_smart_jobs()
tn_job_id = None tn_job_id = None
@ -479,13 +596,35 @@ async def burnin_start(req: StartBurninRequest):
drive_id, req.profile, req.operator, stage_order=req.stage_order drive_id, req.profile, req.operator, stage_order=req.stage_order
) )
results.append({"drive_id": drive_id, "job_id": job_id}) results.append({"drive_id": drive_id, "job_id": job_id})
except burnin.PoolMemberError as exc:
errors.append({
"drive_id": drive_id,
"error": str(exc),
"pool_name": exc.pool_name,
"pool_role": exc.pool_role,
"pool_locked": True,
})
except ValueError as exc: except ValueError as exc:
errors.append({"drive_id": drive_id, "error": str(exc)}) errors.append({"drive_id": drive_id, "error": str(exc)})
if errors and not results: if errors and not results:
raise HTTPException(status_code=409, detail=errors[0]["error"]) # Surface the first error's structured fields so the UI can render
# an unlock affordance instead of a generic toast.
raise HTTPException(status_code=409, detail=errors[0])
return {"queued": results, "errors": errors} return {"queued": results, "errors": errors}
@router.post("/api/v1/drives/{drive_id}/unlock")
async def unlock_pool_drive(drive_id: int, req: UnlockPoolDriveRequest):
try:
expiry = await burnin.grant_pool_unlock(
drive_id, req.confirm_token, req.operator, req.reason,
)
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
return {"unlocked": True, "expires_at": expiry,
"ttl_seconds": burnin.UNLOCK_TTL_SECONDS}
@router.post("/api/v1/burnin/{job_id}/cancel") @router.post("/api/v1/burnin/{job_id}/cancel")
async def burnin_cancel(job_id: int, req: CancelBurninRequest): async def burnin_cancel(job_id: int, req: CancelBurninRequest):
ok = await burnin.cancel_job(job_id, req.operator) ok = await burnin.cancel_job(job_id, req.operator)
@ -562,7 +701,7 @@ async def history_list(
jobs = [dict(r) for r in rows] jobs = [dict(r) for r in rows]
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("history.html", { return templates.TemplateResponse(request, "history.html", {
"request": request, "request": request,
"jobs": jobs, "jobs": jobs,
"active_state": state, "active_state": state,
@ -612,7 +751,7 @@ async def history_detail(
job["stages"] = [dict(r) for r in await cur.fetchall()] job["stages"] = [dict(r) for r in await cur.fetchall()]
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("job_detail.html", { return templates.TemplateResponse(request, "job_detail.html", {
"request": request, "request": request,
"job": job, "job": job,
"poller": ps, "poller": ps,
@ -791,7 +930,7 @@ async def audit_log(
cur = await db.execute(_AUDIT_QUERY) cur = await db.execute(_AUDIT_QUERY)
rows = [dict(r) for r in await cur.fetchall()] rows = [dict(r) for r in await cur.fetchall()]
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("audit.html", { return templates.TemplateResponse(request, "audit.html", {
"request": request, "request": request,
"events": rows, "events": rows,
"event_colors": _AUDIT_EVENT_COLORS, "event_colors": _AUDIT_EVENT_COLORS,
@ -887,7 +1026,7 @@ async def stats_page(
drives_total = (await cur.fetchone())[0] drives_total = (await cur.fetchone())[0]
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("stats.html", { return templates.TemplateResponse(request, "stats.html", {
"request": request, "request": request,
"overall": overall, "overall": overall,
"by_model": by_model, "by_model": by_model,
@ -931,6 +1070,9 @@ async def settings_page(
"temp_warn_c": settings.temp_warn_c, "temp_warn_c": settings.temp_warn_c,
"temp_crit_c": settings.temp_crit_c, "temp_crit_c": settings.temp_crit_c,
"bad_block_threshold": settings.bad_block_threshold, "bad_block_threshold": settings.bad_block_threshold,
"surface_validate_block_size": settings.surface_validate_block_size,
"surface_validate_block_buffer": settings.surface_validate_block_buffer,
"surface_validate_passes": settings.surface_validate_passes,
# SSH credentials (take effect immediately — each SSH call reads live settings) # SSH credentials (take effect immediately — each SSH call reads live settings)
"ssh_host": settings.ssh_host, "ssh_host": settings.ssh_host,
"ssh_port": settings.ssh_port, "ssh_port": settings.ssh_port,
@ -948,7 +1090,7 @@ async def settings_page(
from app import ssh_client as _ssh from app import ssh_client as _ssh
ps = poller.get_state() ps = poller.get_state()
return templates.TemplateResponse("settings.html", { return templates.TemplateResponse(request, "settings.html", {
"request": request, "request": request,
"editable": editable, "editable": editable,
"smtp_enabled": bool(settings.smtp_host), "smtp_enabled": bool(settings.smtp_host),
@ -1069,7 +1211,7 @@ async def history_print(
""", (job_id,)) """, (job_id,))
job["stages"] = [dict(r) for r in await cur.fetchall()] job["stages"] = [dict(r) for r in await cur.fetchall()]
return templates.TemplateResponse("job_print.html", { return templates.TemplateResponse(request, "job_print.html", {
"request": request, "request": request,
"job": job, "job": job,
}) })

View file

@ -38,6 +38,9 @@ _EDITABLE: dict[str, type] = {
"temp_warn_c": int, "temp_warn_c": int,
"temp_crit_c": int, "temp_crit_c": int,
"bad_block_threshold": int, "bad_block_threshold": int,
"surface_validate_block_size": int,
"surface_validate_block_buffer": int,
"surface_validate_passes": int,
# SSH credentials — take effect immediately (each connection reads live settings) # SSH credentials — take effect immediately (each connection reads live settings)
"ssh_host": str, "ssh_host": str,
"ssh_port": int, "ssh_port": int,
@ -96,6 +99,26 @@ def _apply(data: dict) -> None:
if key == "bad_block_threshold" and int(val) < 0: if key == "bad_block_threshold" and int(val) < 0:
log.warning("settings_store: bad_block_threshold must be >= 0 — ignoring") log.warning("settings_store: bad_block_threshold must be >= 0 — ignoring")
continue continue
if key == "surface_validate_block_size":
# badblocks accepts any positive int but in practice the
# useful range is 512..1048576 and it should be a power of 2.
v = int(val)
if v < 512 or v > 1048576 or (v & (v - 1)) != 0:
log.warning(
"settings_store: surface_validate_block_size must be "
"a power of 2 between 512 and 1048576 — ignoring %r", val
)
continue
if key == "surface_validate_block_buffer" and not (1 <= int(val) <= 4096):
log.warning(
"settings_store: surface_validate_block_buffer must be 1..4096 — ignoring"
)
continue
if key == "surface_validate_passes" and not (0 <= int(val) <= 16):
log.warning(
"settings_store: surface_validate_passes must be 0..16 — ignoring"
)
continue
if key == "ssh_port" and not (1 <= int(val) <= 65535): if key == "ssh_port" and not (1 <= int(val) <= 65535):
log.warning("settings_store: ssh_port out of range — ignoring") log.warning("settings_store: ssh_port out of range — ignoring")
continue continue

View file

@ -38,15 +38,26 @@ SMART_ATTRS: dict[int, tuple[str, bool]] = {
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def is_configured() -> bool: def is_configured() -> bool:
"""Returns True when SSH credentials are present and usable.""" """Returns True when SSH host + at least one auth method is available."""
import os
from app.config import settings from app.config import settings
return bool(settings.ssh_host and (settings.ssh_password or settings.ssh_key)) if not settings.ssh_host:
return False
has_creds = bool(
settings.ssh_key
or settings.ssh_password
or os.path.exists(os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH))
)
return has_creds
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Low-level connection # Low-level connection
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
_MOUNTED_KEY_PATH = "/run/secrets/ssh_key"
async def _connect(): async def _connect():
"""Open a single-use SSH connection. Caller must use `async with`.""" """Open a single-use SSH connection. Caller must use `async with`."""
import asyncssh import asyncssh
@ -59,9 +70,17 @@ async def _connect():
"known_hosts": None, # trust all hosts (same spirit as TRUENAS_VERIFY_TLS=false) "known_hosts": None, # trust all hosts (same spirit as TRUENAS_VERIFY_TLS=false)
} }
if settings.ssh_key: if settings.ssh_key:
# Key material provided via env var (base case)
kwargs["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)] kwargs["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)]
if settings.ssh_password: elif settings.ssh_password:
kwargs["password"] = settings.ssh_password kwargs["password"] = settings.ssh_password
else:
# Fall back to mounted key file (preferred for production — no key in env vars)
import os
key_path = os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH)
if os.path.exists(key_path):
kwargs["client_keys"] = [key_path]
# If nothing is configured, asyncssh will attempt agent/default key lookup
return asyncssh.connect(**kwargs) return asyncssh.connect(**kwargs)
@ -228,6 +247,294 @@ async def run_badblocks(
} }
def _parse_zpool_list_output(stdout: str) -> dict:
"""Pure parser for `zpool list -vHP` stdout. Exposed for unit tests.
See get_pool_membership() for output semantics. This function never
raises malformed lines are silently skipped.
"""
import re as _re
def _strip_partition(name: str) -> str:
m = _re.match(r"^(nvme\d+n\d+)", name)
if m:
return m.group(1)
m = _re.match(r"^(sd[a-z]+)", name)
if m:
return m.group(1)
return name
SECTION_MARKERS = {"cache", "log", "logs", "spare", "spares",
"special", "dedup"}
SECTION_NORMALIZE = {"logs": "log", "spares": "spare"}
out: dict = {}
current_pool: str | None = None
current_role: str = "data"
for raw in stdout.splitlines():
if not raw.strip():
continue
depth = 0
while depth < len(raw) and raw[depth] == "\t":
depth += 1
first = raw[depth:].split("\t", 1)[0].strip()
if depth == 0:
current_pool = first
current_role = "data"
continue
if depth == 1:
if first in SECTION_MARKERS:
current_role = SECTION_NORMALIZE.get(first, first)
continue
if first.startswith(("mirror", "raidz", "draid")):
continue
if first.startswith("/dev/") and current_pool:
dn = _strip_partition(first[len("/dev/"):])
out[dn] = {"pool": current_pool, "role": current_role}
continue
if first.startswith("/dev/") and current_pool:
dn = _strip_partition(first[len("/dev/"):])
out[dn] = {"pool": current_pool, "role": current_role}
return out
def _parse_lsblk_zfs_output(stdout: str) -> set:
"""Pure parser for `lsblk -no NAME,FSTYPE -l` stdout. Returns base
devnames carrying ZFS labels (whole-disk OR via partition). Exposed
for unit tests."""
import re as _re
out: set = set()
for line in stdout.splitlines():
parts = line.split()
if len(parts) < 2:
continue
name, fstype = parts[0], parts[1]
if fstype != "zfs_member":
continue
if name.startswith("nvme"):
m = _re.match(r"^(nvme\d+n\d+)", name)
if m:
out.add(m.group(1))
else:
m = _re.match(r"^(sd[a-z]+)", name)
if m:
out.add(m.group(1))
return out
async def get_pool_membership() -> dict | None:
"""Return {devname: {"pool": str, "role": str}} for every drive in any zpool.
Parses `zpool list -vHP` output. Tab-indent depth tells us structure:
depth 0 pool name line
depth 1 vdev type line (mirror-N, raidz*N, draid*) OR section
marker (cache/log/spare/special/dedup/logs) OR a single-disk
vdev that is itself a /dev/... entry
depth 2 device line within a vdev '/dev/sdX', '/dev/nvmeXnY', etc.
may have a partition suffix that we strip back to the
base devname so it matches what TrueNAS reports.
Roles: data | cache | log | spare | special | dedup
Returns:
- {} when the SSH call succeeded and there are genuinely no pools
- None on any failure (SSH down, parse error, non-zero exit, no
stdout). Callers MUST treat None differently from {}: an
empty dict is "definitely no pool members," None is "we
couldn't tell." Treating None as "no pool members" is a
fail-open security regression.
"""
import re as _re
if not is_configured():
return {}
cmd = "zpool list -vHP 2>/dev/null"
try:
async with await _connect() as conn:
r = await conn.run(cmd, check=False)
if r.returncode != 0:
return None
except Exception:
return None
if not r.stdout:
# rc==0 with empty output = host has no pools. (`zpool list -H`
# returns no rows when zero pools are imported.) That's a real
# answer, not a failure.
return {}
return _parse_zpool_list_output(r.stdout)
async def get_smart_health_map(devnames: list[str]) -> dict | None:
"""Return {devname: 'PASSED'|'FAILED'|'UNKNOWN'} for every devname.
Runs `smartctl -H` for each disk in a single SSH session much faster
than one connection per disk. Returns None on any SSH failure so the
poller can fall back to the previously-stored health value rather than
silently overwriting everything as 'UNKNOWN'.
`smartctl -H` is the cheap SMART self-assessment lookup (no full
attribute scan) milliseconds per drive. The output format is stable:
SMART overall-health self-assessment test result: PASSED
SMART overall-health self-assessment test result: FAILED!
For drives that don't support the command at all, smartctl exits
non-zero and we record UNKNOWN for that device specifically.
"""
if not is_configured() or not devnames:
return {} if devnames else None
# Build one shell pipeline that prefixes each result with "@@DEVNAME@@"
# so we can split the combined stdout deterministically.
parts = []
for d in devnames:
# Reject anything that doesn't look like a basic devname so we
# never inject shell metacharacters into the remote command.
if not d.replace("nvme", "").replace("n", "").replace("p", "").replace("sd", "").isalnum():
continue
parts.append(f"echo '@@{d}@@'; smartctl -H /dev/{d} 2>&1; echo '@@END@@'")
if not parts:
return {}
cmd = "; ".join(parts)
try:
async with await _connect() as conn:
r = await asyncio.wait_for(conn.run(cmd, check=False), timeout=30)
except Exception:
return None
if not r.stdout:
return None
return _parse_smart_health_batch(r.stdout)
def _parse_smart_health_batch(stdout: str) -> dict:
"""Pure parser for the batched smartctl -H output. Exposed for tests."""
result: dict[str, str] = {}
current: str | None = None
buf: list[str] = []
def _flush():
if current is None:
return
text = "\n".join(buf)
if "PASSED" in text:
result[current] = "PASSED"
elif "FAILED" in text or "FAILURE" in text:
result[current] = "FAILED"
else:
result[current] = "UNKNOWN"
for raw in stdout.splitlines():
line = raw.strip()
if line.startswith("@@") and line.endswith("@@"):
inner = line[2:-2]
if inner == "END":
_flush()
current = None
buf = []
else:
_flush()
current = inner
buf = []
else:
buf.append(line)
_flush()
return result
async def get_zfs_member_drives() -> set | None:
"""Return devnames of every drive whose partitions carry a ZFS label.
Combined with get_pool_membership(): a drive in this set but NOT in the
active-pool map carries ZFS data from a previously-imported pool that
was exported (or imported on a different system). We treat those as
locked too wiping them would silently destroy a pool.
Returns:
- set() when lsblk succeeded and no drives carry ZFS labels
- None on any failure. Same fail-closed semantics as
get_pool_membership() callers must NOT treat None as
"no exported drives," that's a security regression.
"""
if not is_configured():
return set()
cmd = "lsblk -no NAME,FSTYPE -l 2>/dev/null"
try:
async with await _connect() as conn:
r = await conn.run(cmd, check=False)
if r.returncode != 0:
return None
except Exception:
return None
if not r.stdout:
# lsblk with rc==0 and no output is impossible on a normal Linux
# host; treat as failure rather than "no drives at all."
return None
return _parse_lsblk_zfs_output(r.stdout)
async def get_system_sensors() -> dict:
"""
Run `sensors -j` on TrueNAS and extract system-level temperatures.
Returns {"cpu_c": int|None, "pch_c": int|None}.
cpu_c = CPU package temp (coretemp chip)
pch_c = PCH/chipset temp (pch_* chip) proxy for storage I/O lane thermals
Falls back gracefully if SSH is not configured or lm-sensors is unavailable.
"""
if not is_configured():
return {}
try:
async with await _connect() as conn:
result = await conn.run("sensors -j 2>/dev/null", check=False)
output = result.stdout.strip()
if not output:
return {}
return _parse_sensors_json(output)
except Exception as exc:
log.debug("get_system_sensors failed: %s", exc)
return {}
def _parse_sensors_json(output: str) -> dict:
import json as _json
try:
data = _json.loads(output)
except Exception:
return {}
cpu_c: int | None = None
pch_c: int | None = None
for chip_name, chip_data in data.items():
if not isinstance(chip_data, dict):
continue
# CPU package temp — coretemp chip, "Package id N" sensor
if chip_name.startswith("coretemp") and cpu_c is None:
for sensor_name, sensor_vals in chip_data.items():
if not isinstance(sensor_vals, dict):
continue
if "package" in sensor_name.lower():
for k, v in sensor_vals.items():
if k.endswith("_input") and isinstance(v, (int, float)):
cpu_c = int(round(v))
break
if cpu_c is not None:
break
# PCH / chipset temp — manages PCIe lanes including HBA / storage I/O
elif chip_name.startswith("pch_") and pch_c is None:
for sensor_name, sensor_vals in chip_data.items():
if not isinstance(sensor_vals, dict):
continue
for k, v in sensor_vals.items():
if k.endswith("_input") and isinstance(v, (int, float)):
pch_c = int(round(v))
break
if pch_c is not None:
break
return {"cpu_c": cpu_c, "pch_c": pch_c}
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Parsers # Parsers
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@ -275,7 +582,7 @@ def _parse_smartctl(output: str) -> dict:
def _parse_smart_progress(output: str) -> dict: def _parse_smart_progress(output: str) -> dict:
state = "unknown" state = "unknown"
percent_remaining = 0 percent_remaining = None # None = "in progress but no % line parsed yet"
lower = output.lower() lower = output.lower()

View file

@ -281,7 +281,11 @@ tr:hover td {
.col-size { min-width: 70px; text-align: right; } .col-size { min-width: 70px; text-align: right; }
.col-temp { min-width: 75px; text-align: right; } .col-temp { min-width: 75px; text-align: right; }
.col-health { min-width: 85px; } .col-health { min-width: 85px; }
.col-smart { min-width: 150px; } .col-smart { min-width: 95px; }
/* Tighter horizontal padding on the SMART columns they hold short
pills ("Passed"/"—") or a progress bar, so the default 14px gutter
wastes space on 13" laptops. */
th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
.col-actions { min-width: 170px; } .col-actions { min-width: 170px; }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
@ -1076,6 +1080,56 @@ a.stat-card:hover {
.stat-passed .stat-value { color: var(--green); } .stat-passed .stat-value { color: var(--green); }
.stat-idle .stat-value { color: var(--text-muted); } .stat-idle .stat-value { color: var(--text-muted); }
/* Vertical separator between drive-count cards and sensor chips */
.stats-bar-sep {
width: 1px;
height: 36px;
background: var(--border);
align-self: center;
flex-shrink: 0;
}
/* Compact sensor chip — CPU / PCH / Thermal */
.stat-sensor {
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 8px;
padding: 6px 12px;
text-align: center;
min-width: 52px;
display: flex;
flex-direction: column;
gap: 2px;
}
.stat-sensor-val {
font-size: 16px;
font-weight: 700;
font-variant-numeric: tabular-nums;
line-height: 1.1;
}
.stat-sensor-label {
font-size: 9px;
text-transform: uppercase;
letter-spacing: 0.08em;
color: var(--text-muted);
line-height: 1.2;
}
/* Thermal pressure states */
.stat-sensor-thermal-warn {
border-color: var(--yellow-bd);
background: var(--yellow-bg);
}
.stat-sensor-thermal-warn .stat-sensor-val { color: var(--yellow); }
.stat-sensor-thermal-crit {
border-color: var(--red-bd);
background: var(--red-bg);
}
.stat-sensor-thermal-crit .stat-sensor-val { color: var(--red); }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Batch action bar (inside filter-bar) Batch action bar (inside filter-bar)
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
@ -2372,6 +2426,85 @@ tr.drawer-row-active {
color: var(--yellow); color: var(--yellow);
} }
/* -----------------------------------------------------------------------
Pool-membership lock indicators
----------------------------------------------------------------------- */
.pool-lock-icon {
display: inline-block;
margin-right: 4px;
font-size: 12px;
color: var(--yellow);
vertical-align: baseline;
}
.pool-lock-icon.pool-lock-boot {
color: var(--red, #e25555);
}
.pool-pill {
display: inline-block;
margin-top: 3px;
padding: 1px 7px;
font-size: 10.5px;
font-weight: 600;
letter-spacing: 0.3px;
text-transform: uppercase;
border-radius: 4px;
background: color-mix(in srgb, var(--yellow) 14%, transparent);
color: var(--yellow);
border: 1px solid color-mix(in srgb, var(--yellow) 35%, transparent);
}
.pool-pill.pool-pill-boot {
background: color-mix(in srgb, var(--red, #e25555) 16%, transparent);
color: var(--red, #e25555);
border-color: color-mix(in srgb, var(--red, #e25555) 45%, transparent);
}
.pool-pill.pool-pill-exported {
background: color-mix(in srgb, #e07a3f 16%, transparent);
color: #e07a3f;
border-color: color-mix(in srgb, #e07a3f 45%, transparent);
}
.pool-lock-icon.pool-lock-exported {
color: #e07a3f;
}
.btn-unlock {
background: transparent;
border: 1px solid color-mix(in srgb, var(--yellow) 50%, transparent);
color: var(--yellow);
border-radius: 5px;
padding: 3px 9px;
font-size: 12px;
cursor: pointer;
transition: background .15s, color .15s, border-color .15s;
}
.btn-unlock:hover {
background: color-mix(in srgb, var(--yellow) 14%, transparent);
}
.btn-unlock-boot {
border-color: color-mix(in srgb, var(--red, #e25555) 55%, transparent);
color: var(--red, #e25555);
}
.btn-unlock-boot:hover {
background: color-mix(in srgb, var(--red, #e25555) 14%, transparent);
}
.btn-unlock-exported {
border-color: color-mix(in srgb, #e07a3f 55%, transparent);
color: #e07a3f;
}
.btn-unlock-exported:hover {
background: color-mix(in srgb, #e07a3f 14%, transparent);
}
.unlock-countdown {
margin-left: 4px;
font-size: 11px;
color: var(--green, #39c179);
font-variant-numeric: tabular-nums;
}
.unlock-countdown-expired {
color: var(--yellow);
}
.modal.modal-danger {
border-top: 3px solid var(--red, #e25555);
}
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Parallel burn-in inline warning Parallel burn-in inline warning
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
@ -2409,41 +2542,3 @@ tr.drawer-row-active {
font-variant-numeric: tabular-nums; font-variant-numeric: tabular-nums;
} }
/* -----------------------------------------------------------------------
Live Terminal drawer panel (xterm.js)
----------------------------------------------------------------------- */
.drawer-panel-terminal {
padding: 0 !important;
overflow: hidden !important;
position: relative;
background: #0d1117;
}
/* Let xterm fill the full panel height */
.drawer-panel-terminal .xterm {
height: 100%;
}
.drawer-panel-terminal .xterm-viewport {
overflow-y: auto !important;
}
/* Reconnect bar — floats over the terminal when disconnected */
.term-reconnect-bar {
position: absolute;
bottom: 12px;
right: 12px;
z-index: 20;
display: flex;
align-items: center;
gap: 8px;
background: rgba(13,17,23,0.85);
border: 1px solid var(--border);
border-radius: 6px;
padding: 6px 10px;
font-size: 12px;
color: var(--text-muted);
}
.term-reconnect-bar .btn-secondary {
padding: 3px 10px;
font-size: 11px;
}

View file

@ -68,6 +68,7 @@
applyFilter(activeFilter); applyFilter(activeFilter);
restoreCheckboxes(); restoreCheckboxes();
initElapsedTimers(); initElapsedTimers();
initUnlockCountdowns();
initLocationEdits(); initLocationEdits();
if (_drawerDriveId) { if (_drawerDriveId) {
_drawerHighlightRow(_drawerDriveId); _drawerHighlightRow(_drawerDriveId);
@ -135,14 +136,59 @@
if (nb) nb.style.display = 'none'; if (nb) nb.style.display = 'none';
} }
// Handle job-alert SSE events for browser notifications // Handle SSE events
document.addEventListener('htmx:sseMessage', function (e) { document.addEventListener('htmx:sseMessage', function (e) {
if (!e.detail || e.detail.type !== 'job-alert') return; if (!e.detail) return;
try { if (e.detail.type === 'job-alert') {
handleJobAlert(JSON.parse(e.detail.data)); try { handleJobAlert(JSON.parse(e.detail.data)); } catch (_) {}
} catch (_) {} } else if (e.detail.type === 'system-sensors') {
try { handleSystemSensors(JSON.parse(e.detail.data)); } catch (_) {}
}
}); });
function handleSystemSensors(data) {
var st = data.system_temps || {};
var tp = data.thermal_pressure || 'ok';
var warn = data.temp_warn_c || 46;
var crit = data.temp_crit_c || 55;
function tempClass(c) {
if (c == null) return '';
return c >= crit ? 'temp-hot' : c >= warn ? 'temp-warm' : 'temp-cool';
}
// CPU chip
var cpuChip = document.getElementById('sensor-cpu');
var cpuVal = document.getElementById('sensor-cpu-val');
if (cpuVal && st.cpu_c != null) {
if (cpuChip) cpuChip.hidden = false;
cpuVal.textContent = st.cpu_c + '°';
cpuVal.className = 'stat-sensor-val ' + tempClass(st.cpu_c);
}
// PCH chip
var pchChip = document.getElementById('sensor-pch');
var pchVal = document.getElementById('sensor-pch-val');
if (pchVal && st.pch_c != null) {
if (pchChip) pchChip.hidden = false;
pchVal.textContent = st.pch_c + '°';
pchVal.className = 'stat-sensor-val ' + tempClass(st.pch_c);
}
// Thermal pressure chip
var tChip = document.getElementById('sensor-thermal');
var tVal = document.getElementById('sensor-thermal-val');
if (tChip && tVal) {
if (tp === 'warn' || tp === 'crit') {
tChip.hidden = false;
tChip.className = 'stat-sensor stat-sensor-thermal stat-sensor-thermal-' + tp;
tVal.textContent = tp === 'warn' ? 'WARM' : 'HOT';
} else {
tChip.hidden = true;
}
}
}
function handleJobAlert(data) { function handleJobAlert(data) {
var isPass = data.state === 'passed'; var isPass = data.state === 'passed';
var icon = isPass ? '✓' : '✕'; var icon = isPass ? '✓' : '✕';
@ -203,6 +249,41 @@
initElapsedTimers(); initElapsedTimers();
// Live countdown for pool-drive unlock TTL — runs once per second; ticker
// self-stops when no .unlock-countdown spans remain on the page.
var _unlockTickInterval = null;
function tickUnlockCountdowns() {
var spans = document.querySelectorAll('.unlock-countdown[data-expires]');
if (spans.length === 0) {
if (_unlockTickInterval) {
clearInterval(_unlockTickInterval);
_unlockTickInterval = null;
}
return;
}
var nowSec = Date.now() / 1000;
spans.forEach(function (el) {
var exp = parseFloat(el.dataset.expires);
if (!exp || isNaN(exp)) return;
var rem = Math.max(0, exp - nowSec);
if (rem <= 0) {
el.textContent = 'expired';
el.className = 'unlock-countdown unlock-countdown-expired';
return;
}
var m = Math.floor(rem / 60);
var s = Math.floor(rem % 60);
el.textContent = '\u{1F513} ' + m + ':' + (s < 10 ? '0' : '') + s;
});
}
function initUnlockCountdowns() {
if (_unlockTickInterval) return;
if (document.querySelectorAll('.unlock-countdown[data-expires]').length === 0) return;
_unlockTickInterval = setInterval(tickUnlockCountdowns, 1000);
tickUnlockCountdowns();
}
initUnlockCountdowns();
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
// Inline location / notes edit // Inline location / notes edit
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
@ -538,7 +619,16 @@
var data = await resp.json(); var data = await resp.json();
if (!resp.ok) { if (!resp.ok) {
showToast(data.detail || 'Failed to start burn-in', 'error'); // detail may be the structured pool-locked object {drive_id,
// pool_name, pool_role, pool_locked: true, error: "..."}.
// The user already opened the start modal, so the unlock TTL must
// have just expired between modal-open and submit. Auto-flip to
// the unlock modal for that drive.
if (_handlePoolLockedError(data.detail)) {
closeModal();
return;
}
showToast(_extractErrorMessage(data.detail) || 'Failed to start burn-in', 'error');
return; return;
} }
@ -549,6 +639,161 @@
} }
} }
// Helpers shared between single-drive and batch start error paths.
// Backend returns either a string (legacy errors) or, for pool-locked
// drives, an object: {drive_id, error, pool_name, pool_role, pool_locked}.
function _extractErrorMessage(detail) {
if (!detail) return null;
if (typeof detail === 'string') return detail;
if (typeof detail === 'object' && detail.error) return detail.error;
return null;
}
// Returns true if it handled a pool-locked error by opening the unlock
// modal for the offending drive. Caller should bail out.
function _handlePoolLockedError(detail) {
if (!detail || typeof detail !== 'object' || !detail.pool_locked) return false;
var driveId = detail.drive_id;
if (driveId == null) return false;
var btn = document.querySelector('.btn-unlock[data-drive-id="' + driveId + '"]');
if (btn) {
// openUnlockModal closes any other open modals as a side effect of
// calling its own close handlers; we still need to close the
// start/batch modal explicitly in the caller, since openUnlockModal
// doesn't know which one is open.
openUnlockModal(btn);
return true;
}
// Unlock button not in the DOM (drive row may have refreshed).
// Surface a descriptive toast instead of [object Object].
showToast(
(detail.error || 'Drive is pool-locked') +
' Reload the page and click Unlock on the drive row.',
'error',
);
return true;
}
// -----------------------------------------------------------------------
// Pool-drive Unlock modal
// -----------------------------------------------------------------------
var unlockDriveId = null;
var unlockExpectedToken = null;
function openUnlockModal(btn) {
unlockDriveId = btn.dataset.driveId;
var poolName = btn.dataset.poolName || '';
var poolRole = btn.dataset.poolRole || 'data';
var isBoot = btn.dataset.isBootPool === '1';
var isExported = btn.dataset.isExported === '1';
if (isBoot) unlockExpectedToken = 'DESTROY BOOT POOL';
else if (isExported) unlockExpectedToken = 'DESTROY EXPORTED POOL';
else unlockExpectedToken = poolName;
document.getElementById('unlock-devname').textContent = btn.dataset.devname || '—';
document.getElementById('unlock-model').textContent = btn.dataset.model || '—';
document.getElementById('unlock-serial').textContent = btn.dataset.serial || '—';
document.getElementById('unlock-size').textContent = btn.dataset.size || '—';
var chip = document.getElementById('unlock-pool-chip');
if (isExported) {
chip.textContent = 'exported ZFS';
chip.className = 'chip chip-aborted';
} else {
chip.textContent = poolName + ' · ' + poolRole;
chip.className = 'chip ' + (isBoot ? 'chip-failed' : 'chip-aborted');
}
var titleEl = document.getElementById('unlock-modal-title');
var warnTitle = document.getElementById('unlock-warning-title');
var warnBody = document.getElementById('unlock-warning-body');
if (isBoot) {
titleEl.textContent = 'Unlock BOOT POOL drive';
warnTitle.textContent = 'This is a TrueNAS BOOT drive.';
warnBody.textContent =
'Running burn-in on this drive will destroy the operating system on it. ' +
'If this drive is half of a mirrored boot pool, the system will continue running on the other mirror, ' +
'but you must already have a replacement plan. Proceeding without one bricks the host.';
} else if (isExported) {
titleEl.textContent = 'Unlock drive with EXPORTED ZFS data';
warnTitle.textContent = 'This drive carries ZFS data from a previously-imported pool.';
warnBody.textContent =
"TrueNAS isn't using this pool right now, but the drive still holds the labels and data. " +
'Burning it in will silently destroy whatever pool that data belongs to — including ' +
'pools that another system may be relying on. Confirm you have already evacuated or ' +
'reassigned the pool before continuing.';
} else {
titleEl.textContent = 'Unlock pool drive';
warnTitle.textContent = "This drive belongs to zpool '" + poolName + "'.";
warnBody.textContent =
'Running a destructive burn-in stage will overwrite all data on this drive ' +
'and almost certainly destroy the pool. Only proceed if you have already ' +
'removed this drive from the pool, or if you are intentionally decommissioning the pool.';
}
document.getElementById('unlock-confirm-token').textContent = unlockExpectedToken;
document.getElementById('unlock-confirm-hint').textContent = 'Expected: ' + unlockExpectedToken;
document.getElementById('unlock-confirm-input').value = '';
document.getElementById('unlock-reason-input').value = '';
var savedOp = localStorage.getItem('burnin_operator') || '';
document.getElementById('unlock-operator-input').value = savedOp;
validateUnlockModal();
document.getElementById('unlock-modal').removeAttribute('hidden');
setTimeout(function () {
document.getElementById('unlock-operator-input').focus();
}, 50);
}
function closeUnlockModal() {
document.getElementById('unlock-modal').setAttribute('hidden', '');
unlockDriveId = null;
unlockExpectedToken = null;
}
function validateUnlockModal() {
var op = (document.getElementById('unlock-operator-input').value || '').trim();
var rsn = (document.getElementById('unlock-reason-input').value || '').trim();
var tok = (document.getElementById('unlock-confirm-input').value || '').trim();
var ok = op.length > 0 && rsn.length >= 5 && tok === unlockExpectedToken;
document.getElementById('unlock-modal-submit-btn').disabled = !ok;
}
async function submitUnlock() {
var op = (document.getElementById('unlock-operator-input').value || '').trim();
var rsn = (document.getElementById('unlock-reason-input').value || '').trim();
var tok = (document.getElementById('unlock-confirm-input').value || '').trim();
localStorage.setItem('burnin_operator', op);
var btn = document.getElementById('unlock-modal-submit-btn');
btn.disabled = true;
try {
var resp = await fetch('/api/v1/drives/' + unlockDriveId + '/unlock', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
confirm_token: tok,
operator: op,
reason: rsn,
}),
});
var data = await resp.json();
if (!resp.ok) {
showToast(data.detail || 'Unlock failed', 'error');
btn.disabled = false;
return;
}
closeUnlockModal();
showToast('Unlocked for 10 minutes — start burn-in now to use it.', 'success');
// Force a drive list refresh so the row flips from Unlock → Burn-In
if (typeof refreshDrives === 'function') refreshDrives();
} catch (err) {
showToast('Network error', 'error');
btn.disabled = false;
}
}
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
// Batch Burn-In // Batch Burn-In
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
@ -729,7 +974,11 @@
}); });
var data = await resp.json(); var data = await resp.json();
if (!resp.ok) { if (!resp.ok) {
showToast(data.detail || 'Failed to queue batch', 'error'); if (_handlePoolLockedError(data.detail)) {
closeBatchModal();
return;
}
showToast(_extractErrorMessage(data.detail) || 'Failed to queue batch', 'error');
if (btn) btn.disabled = false; if (btn) btn.disabled = false;
return; return;
} }
@ -738,10 +987,17 @@
checkedDriveIds.clear(); checkedDriveIds.clear();
updateBatchBar(); updateBatchBar();
var queued = (data.queued || []).length; var queued = (data.queued || []).length;
var errors = (data.errors || []).length; var allErrors = data.errors || [];
var msg = queued + ' burn-in(s) queued'; var poolLocked = allErrors.filter(function (e) { return e && e.pool_locked; });
if (errors) msg += ', ' + errors + ' skipped (already active)'; var alreadyActive = allErrors.length - poolLocked.length;
showToast(msg, errors && !queued ? 'error' : 'success');
var parts = [queued + ' burn-in(s) queued'];
if (alreadyActive) parts.push(alreadyActive + ' skipped (already active)');
if (poolLocked.length) {
parts.push(poolLocked.length + ' pool-locked (use Unlock on each row)');
}
var tone = (queued === 0 && allErrors.length) ? 'error' : 'success';
showToast(parts.join(', '), tone);
} catch (err) { } catch (err) {
showToast('Network error', 'error'); showToast('Network error', 'error');
if (btn) btn.disabled = false; if (btn) btn.disabled = false;
@ -792,6 +1048,10 @@
var cancelSmartBtn = e.target.closest('.btn-cancel-smart'); var cancelSmartBtn = e.target.closest('.btn-cancel-smart');
if (cancelSmartBtn && !cancelSmartBtn.disabled) { cancelSmartTest(cancelSmartBtn); return; } if (cancelSmartBtn && !cancelSmartBtn.disabled) { cancelSmartTest(cancelSmartBtn); return; }
// Pool-drive unlock button (single drive)
var unlockBtn = e.target.closest('.btn-unlock');
if (unlockBtn && !unlockBtn.disabled) { openUnlockModal(unlockBtn); return; }
// Burn-in start button (single drive) // Burn-in start button (single drive)
var startBtn = e.target.closest('.btn-start'); var startBtn = e.target.closest('.btn-start');
if (startBtn && !startBtn.disabled) { openModal(startBtn); return; } if (startBtn && !startBtn.disabled) { openModal(startBtn); return; }
@ -820,6 +1080,14 @@
return; return;
} }
// Unlock modal
if (e.target.closest('#unlock-modal-close-btn') || e.target.closest('#unlock-modal-cancel-btn')) {
closeUnlockModal();
return;
}
if (e.target.id === 'unlock-modal') { closeUnlockModal(); return; }
if (e.target.id === 'unlock-modal-submit-btn') { submitUnlock(); return; }
// Batch modal close // Batch modal close
if (e.target.closest('#batch-modal-close-btn') || e.target.closest('#batch-modal-cancel-btn')) { if (e.target.closest('#batch-modal-close-btn') || e.target.closest('#batch-modal-cancel-btn')) {
closeBatchModal(); closeBatchModal();
@ -837,11 +1105,15 @@
document.addEventListener('input', function (e) { document.addEventListener('input', function (e) {
var id = e.target.id; var id = e.target.id;
if (id === 'unlock-operator-input' || id === 'unlock-reason-input' ||
id === 'unlock-confirm-input') validateUnlockModal();
if (id === 'operator-input' || id === 'confirm-serial') validateModal(); if (id === 'operator-input' || id === 'confirm-serial') validateModal();
}); });
document.addEventListener('keydown', function (e) { document.addEventListener('keydown', function (e) {
if (e.key === 'Escape') { if (e.key === 'Escape') {
var uModal = document.getElementById('unlock-modal');
if (uModal && !uModal.hidden) { closeUnlockModal(); return; }
var modal = document.getElementById('start-modal'); var modal = document.getElementById('start-modal');
if (modal && !modal.hidden) { closeModal(); return; } if (modal && !modal.hidden) { closeModal(); return; }
var bModal = document.getElementById('batch-modal'); var bModal = document.getElementById('batch-modal');
@ -1117,14 +1389,6 @@
document.querySelectorAll('.drawer-panel').forEach(function (p) { document.querySelectorAll('.drawer-panel').forEach(function (p) {
p.classList.toggle('active', p.id === 'drawer-panel-' + _drawerTab); p.classList.toggle('active', p.id === 'drawer-panel-' + _drawerTab);
}); });
// Terminal tab: init/fit on activation; hide autoscroll (N/A for terminal)
var asl = document.querySelector('.autoscroll-label');
if (_drawerTab === 'terminal') {
if (asl) asl.style.visibility = 'hidden';
openTerminalTab();
} else {
if (asl) asl.style.visibility = '';
}
}); });
// Close button // Close button
@ -1149,155 +1413,4 @@
}).catch(function () { showToast('Network error', 'error'); }); }).catch(function () { showToast('Network error', 'error'); });
}); });
// -----------------------------------------------------------------------
// Live Terminal (xterm.js + SSH WebSocket)
// -----------------------------------------------------------------------
var _xtermReady = false; // xterm.js + FitAddon libraries loaded
var _terminal = null; // xterm.js Terminal instance
var _termFit = null; // FitAddon instance
var _termWs = null; // active WebSocket (null = disconnected)
function _loadXtermLibs(cb) {
var link = document.createElement('link');
link.rel = 'stylesheet';
link.href = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/css/xterm.css';
document.head.appendChild(link);
var s1 = document.createElement('script');
s1.src = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/lib/xterm.js';
s1.onload = function () {
var s2 = document.createElement('script');
s2.src = 'https://cdn.jsdelivr.net/npm/xterm-addon-fit@0.8.0/lib/xterm-addon-fit.js';
s2.onload = cb;
document.head.appendChild(s2);
};
document.head.appendChild(s1);
}
function openTerminalTab() {
var panel = document.getElementById('drawer-panel-terminal');
if (!panel) return;
if (!_xtermReady) {
panel.innerHTML = '<div class="drawer-loading">Loading terminal\u2026</div>';
_loadXtermLibs(function () {
_xtermReady = true;
_termInit(panel);
});
return;
}
if (!_terminal) {
_termInit(panel);
return;
}
// Already initialised — refit to current panel dimensions
setTimeout(function () {
if (_termFit) try { _termFit.fit(); } catch (_) {}
}, 30);
}
function _termInit(panel) {
panel.innerHTML = '';
var term = new Terminal({
cursorBlink: true,
fontSize: 13,
fontFamily: '"SF Mono","Fira Code",Consolas,"DejaVu Sans Mono",monospace',
theme: {
background: '#0d1117',
foreground: '#e6edf3',
cursor: '#58a6ff',
cursorAccent: '#0d1117',
selectionBackground: 'rgba(88,166,255,0.25)',
black: '#484f58', red: '#ff7b72', green: '#3fb950', yellow: '#d29922',
blue: '#58a6ff', magenta: '#bc8cff', cyan: '#39c5cf', white: '#b1bac4',
brightBlack: '#6e7681', brightRed: '#ffa198', brightGreen: '#56d364',
brightYellow: '#e3b341', brightBlue: '#79c0ff', brightMagenta: '#d2a8ff',
brightCyan: '#56d4dd', brightWhite: '#f0f6fc',
},
scrollback: 2000,
allowProposedApi: true,
});
var fit = new FitAddon.FitAddon();
term.loadAddon(fit);
term.open(panel);
_terminal = term;
_termFit = fit;
// Initial fit after the panel is visible
setTimeout(function () {
if (_termFit) try { _termFit.fit(); } catch (_) {}
}, 30);
// Forward all keystrokes → SSH (onData registered once here)
term.onData(function (data) {
if (_termWs && _termWs.readyState === 1) {
_termWs.send(new TextEncoder().encode(data));
}
});
// Refit + notify server on resize
new ResizeObserver(function () {
if (!_termFit) return;
try { _termFit.fit(); } catch (_) {}
if (_termWs && _termWs.readyState === 1 && _terminal) {
_termWs.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
}
}).observe(panel);
_termConnect();
}
function _termConnect() {
if (_termWs && _termWs.readyState <= 1) return; // already open or connecting
var proto = location.protocol === 'https:' ? 'wss:' : 'ws:';
var ws = new WebSocket(proto + '//' + location.host + '/ws/terminal');
ws.binaryType = 'arraybuffer';
_termWs = ws;
ws.onopen = function () {
_termHideReconnect();
if (_terminal && ws.readyState === 1) {
ws.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
}
};
ws.onmessage = function (e) {
if (!_terminal) return;
_terminal.write(e.data instanceof ArrayBuffer ? new Uint8Array(e.data) : e.data);
};
ws.onclose = function () {
if (_terminal) _terminal.write('\r\n\x1b[33m\u2500\u2500 disconnected \u2500\u2500\x1b[0m\r\n');
_termShowReconnect();
};
ws.onerror = function () { /* onclose fires too */ };
}
function _termShowReconnect() {
var panel = document.getElementById('drawer-panel-terminal');
if (!panel || panel.querySelector('.term-reconnect-bar')) return;
var bar = document.createElement('div');
bar.className = 'term-reconnect-bar';
bar.innerHTML = '<span>Connection closed</span>'
+ '<button class="btn-secondary">\u21ba Reconnect</button>';
bar.querySelector('button').onclick = function () {
bar.remove();
_termConnect();
};
panel.appendChild(bar);
}
function _termHideReconnect() {
var bar = document.querySelector('.term-reconnect-bar');
if (bar) bar.remove();
}
}()); }());

View file

@ -80,11 +80,14 @@
{%- set bi_active = drive.burnin and drive.burnin.state in ('queued', 'running') %} {%- set bi_active = drive.burnin and drive.burnin.state in ('queued', 'running') %}
{%- set short_busy = drive.smart_short and drive.smart_short.state == 'running' %} {%- set short_busy = drive.smart_short and drive.smart_short.state == 'running' %}
{%- set long_busy = drive.smart_long and drive.smart_long.state == 'running' %} {%- set long_busy = drive.smart_long and drive.smart_long.state == 'running' %}
{%- set selectable = not bi_active and not short_busy and not long_busy %} {%- set pool_locked = drive.pool_name and not drive.pool_unlocked_until %}
{%- set is_boot_pool = drive.pool_name == 'boot-pool' %}
{%- set is_exported = drive.pool_role == 'exported' %}
{%- set selectable = not bi_active and not short_busy and not long_busy and not pool_locked %}
{%- set bi_done = drive.burnin and drive.burnin.state in ('passed', 'failed', 'cancelled', 'unknown') %} {%- set bi_done = drive.burnin and drive.burnin.state in ('passed', 'failed', 'cancelled', 'unknown') %}
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted')) {%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %} or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy %} {%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %}
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}"> <tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}">
<td class="col-check"> <td class="col-check">
{%- if selectable %} {%- if selectable %}
@ -92,8 +95,18 @@
{%- endif %} {%- endif %}
</td> </td>
<td class="col-drive"> <td class="col-drive">
<span class="drive-name">{{ drive.devname }}</span> <span class="drive-name">
{%- if drive.pool_name -%}
<span class="pool-lock-icon{% if is_boot_pool %} pool-lock-boot{% elif is_exported %} pool-lock-exported{% endif %}"
title="{% if is_boot_pool %}In BOOT POOL '{{ drive.pool_name }}'{% elif is_exported %}Carries ZFS data from a previously-imported pool{% else %}In pool '{{ drive.pool_name }}' ({{ drive.pool_role or 'data' }}){% endif %}">&#x1F512;</span>
{%- endif -%}
{{ drive.devname }}
</span>
<span class="drive-model">{{ drive.model or "Unknown" }}</span> <span class="drive-model">{{ drive.model or "Unknown" }}</span>
{%- if drive.pool_name %}
<span class="pool-pill{% if is_boot_pool %} pool-pill-boot{% elif is_exported %} pool-pill-exported{% endif %}"
title="ZFS pool membership">{% if is_exported %}exported ZFS{% else %}{{ drive.pool_name }} &middot; {{ drive.pool_role or 'data' }}{% endif %}</span>
{%- endif %}
{%- if drive.location %} {%- if drive.location %}
<span class="drive-location" <span class="drive-location"
data-drive-id="{{ drive.id }}" data-drive-id="{{ drive.id }}"
@ -154,6 +167,20 @@
{% if short_busy %}disabled{% endif %} {% if short_busy %}disabled{% endif %}
title="Start Long SMART test (~several hours)">Long</button> title="Start Long SMART test (~several hours)">Long</button>
{%- endif %} {%- endif %}
{%- if pool_locked %}
<!-- Drive is in a zpool — replace Burn-In with Unlock affordance -->
<button class="btn-action btn-unlock{% if is_boot_pool %} btn-unlock-boot{% elif is_exported %} btn-unlock-exported{% endif %}"
data-drive-id="{{ drive.id }}"
data-devname="{{ drive.devname }}"
data-serial="{{ drive.serial or '' }}"
data-model="{{ drive.model or 'Unknown' }}"
data-size="{{ drive.size_bytes | format_bytes }}"
data-pool-name="{{ drive.pool_name }}"
data-pool-role="{{ drive.pool_role or 'data' }}"
data-is-boot-pool="{{ '1' if is_boot_pool else '0' }}"
data-is-exported="{{ '1' if is_exported else '0' }}"
title="{% if is_boot_pool %}Drive is in BOOT POOL '{{ drive.pool_name }}' — click to unlock{% elif is_exported %}Drive carries ZFS data from a previously-imported pool — click to unlock{% else %}Drive is in pool '{{ drive.pool_name }}' — click to unlock{% endif %}">&#x1F512; Unlock</button>
{%- else %}
<!-- Burn-In --> <!-- Burn-In -->
<button class="btn-action btn-start{% if short_busy or long_busy %} btn-disabled{% endif %}" <button class="btn-action btn-start{% if short_busy or long_busy %} btn-disabled{% endif %}"
data-drive-id="{{ drive.id }}" data-drive-id="{{ drive.id }}"
@ -162,8 +189,10 @@
data-model="{{ drive.model or 'Unknown' }}" data-model="{{ drive.model or 'Unknown' }}"
data-size="{{ drive.size_bytes | format_bytes }}" data-size="{{ drive.size_bytes | format_bytes }}"
data-health="{{ drive.smart_health }}" data-health="{{ drive.smart_health }}"
data-pool-name="{{ drive.pool_name or '' }}"
data-pool-unlocked-until="{{ drive.pool_unlocked_until or '' }}"
{% if short_busy or long_busy %}disabled{% endif %} {% if short_busy or long_busy %}disabled{% endif %}
title="Start Burn-In">Burn-In</button> title="Start Burn-In{% if drive.pool_name %} (UNLOCKED — pool drive){% endif %}">Burn-In{% if drive.pool_name %} <span class="unlock-countdown" data-expires="{{ drive.pool_unlocked_until }}">&#x1F513;</span>{% endif %}</button>
<!-- Reset — clears SMART state so drive can be re-tested from scratch --> <!-- Reset — clears SMART state so drive can be re-tested from scratch -->
{%- if can_reset %} {%- if can_reset %}
<button class="btn-action btn-reset" <button class="btn-action btn-reset"
@ -171,6 +200,7 @@
title="Reset SMART state — clears test results so drive shows as fresh">Reset</button> title="Reset SMART state — clears test results so drive shows as fresh">Reset</button>
{%- endif %} {%- endif %}
{%- endif %} {%- endif %}
{%- endif %}
</div> </div>
</td> </td>
</tr> </tr>

View file

@ -0,0 +1,69 @@
<div id="unlock-modal" class="modal-overlay" hidden aria-modal="true" role="dialog">
<div class="modal modal-danger">
<div class="modal-header">
<h2 class="modal-title" id="unlock-modal-title">Unlock pool drive</h2>
<button class="modal-close" id="unlock-modal-close-btn" aria-label="Close">&#x2715;</button>
</div>
<div class="modal-body">
<div class="modal-drive-info">
<div class="modal-drive-row">
<span class="modal-devname" id="unlock-devname">&mdash;</span>
<span class="chip" id="unlock-pool-chip">&mdash;</span>
</div>
<div class="modal-drive-sub">
<span id="unlock-model">&mdash;</span>
&middot;
<span id="unlock-size">&mdash;</span>
&middot;
<span class="mono" id="unlock-serial">&mdash;</span>
</div>
</div>
<div id="unlock-warning" class="confirm-warning">
<strong id="unlock-warning-title">This drive belongs to a zpool.</strong>
<p id="unlock-warning-body">
Running a destructive burn-in stage will overwrite all data on this drive
and almost certainly destroy the pool. Only proceed if you have already
removed this drive from the pool, or if you are intentionally
decommissioning the pool.
</p>
</div>
<div class="form-group">
<label class="form-label" for="unlock-operator-input">Operator</label>
<input class="form-input" type="text" id="unlock-operator-input"
placeholder="Your name" autocomplete="name" maxlength="64">
</div>
<div class="form-group">
<label class="form-label" for="unlock-reason-input">
Reason (recorded to audit log, minimum 5 characters)
</label>
<input class="form-input" type="text" id="unlock-reason-input"
placeholder="e.g. replacing failed drive in tank/raidz2-0"
autocomplete="off" maxlength="200">
</div>
<div class="form-group">
<label class="form-label" for="unlock-confirm-input" id="unlock-confirm-label">
Type <code id="unlock-confirm-token">&mdash;</code> to confirm
</label>
<input class="form-input form-input-confirm" type="text" id="unlock-confirm-input"
placeholder="" autocomplete="off" spellcheck="false">
<div class="confirm-hint" id="unlock-confirm-hint"></div>
</div>
<div class="stage-always-note">
Unlock lasts 10 minutes. After that, this drive locks again automatically.
</div>
</div>
<div class="modal-footer">
<button class="btn-secondary" id="unlock-modal-cancel-btn">Cancel</button>
<button class="btn-danger" id="unlock-modal-submit-btn" disabled>Unlock</button>
</div>
</div>
</div>

View file

@ -5,8 +5,9 @@
{% block content %} {% block content %}
{% include "components/modal_start.html" %} {% include "components/modal_start.html" %}
{% include "components/modal_batch.html" %} {% include "components/modal_batch.html" %}
{% include "components/modal_unlock.html" %}
<!-- Stats bar — counts are updated live by app.js updateCounts() --> <!-- Stats bar — drive counts updated live by app.js updateCounts(); sensor chips updated by SSE system-sensors event -->
<div class="stats-bar"> <div class="stats-bar">
<div class="stat-card" data-stat-filter="all"> <div class="stat-card" data-stat-filter="all">
<span class="stat-value" id="stat-all">{{ drives | length }}</span> <span class="stat-value" id="stat-all">{{ drives | length }}</span>
@ -28,6 +29,33 @@
<span class="stat-value" id="stat-idle">0</span> <span class="stat-value" id="stat-idle">0</span>
<span class="stat-label">Idle</span> <span class="stat-label">Idle</span>
</div> </div>
{%- set st = poller.system_temps if (poller and poller.system_temps) else {} %}
{%- if st.get('cpu_c') is not none or st.get('pch_c') is not none %}
<div class="stats-bar-sep"></div>
{%- if st.get('cpu_c') is not none %}
<div class="stat-sensor" id="sensor-cpu">
<span class="stat-sensor-val {{ st.get('cpu_c') | temp_class }}" id="sensor-cpu-val">{{ st.get('cpu_c') }}°</span>
<span class="stat-sensor-label">CPU</span>
</div>
{%- endif %}
{%- if st.get('pch_c') is not none %}
<div class="stat-sensor" id="sensor-pch">
<span class="stat-sensor-val {{ st.get('pch_c') | temp_class }}" id="sensor-pch-val">{{ st.get('pch_c') }}°</span>
<span class="stat-sensor-label">PCH</span>
</div>
{%- endif %}
{%- endif %}
{%- set tp = poller.thermal_pressure if poller else 'ok' %}
<div class="stat-sensor stat-sensor-thermal stat-sensor-thermal-{{ tp }}"
id="sensor-thermal"
{% if not tp or tp == 'ok' %}hidden{% endif %}>
<span class="stat-sensor-val" id="sensor-thermal-val">
{%- if tp == 'warn' %}WARM{%- elif tp == 'crit' %}HOT{%- else %}OK{%- endif %}
</span>
<span class="stat-sensor-label">Thermal</span>
</div>
</div> </div>
<!-- Failed drive banner — shown/hidden by JS when failed count > 0 --> <!-- Failed drive banner — shown/hidden by JS when failed count > 0 -->
@ -83,7 +111,6 @@
<button class="drawer-tab active" data-tab="burnin">Burn-In</button> <button class="drawer-tab active" data-tab="burnin">Burn-In</button>
<button class="drawer-tab" data-tab="smart">SMART</button> <button class="drawer-tab" data-tab="smart">SMART</button>
<button class="drawer-tab" data-tab="events">Events</button> <button class="drawer-tab" data-tab="events">Events</button>
<button class="drawer-tab" data-tab="terminal">Terminal</button>
</nav> </nav>
<div class="drawer-controls"> <div class="drawer-controls">
<label class="autoscroll-label"> <label class="autoscroll-label">
@ -97,7 +124,6 @@
<div class="drawer-panel active" id="drawer-panel-burnin"></div> <div class="drawer-panel active" id="drawer-panel-burnin"></div>
<div class="drawer-panel" id="drawer-panel-smart"></div> <div class="drawer-panel" id="drawer-panel-smart"></div>
<div class="drawer-panel" id="drawer-panel-events"></div> <div class="drawer-panel" id="drawer-panel-events"></div>
<div class="drawer-panel drawer-panel-terminal" id="drawer-panel-terminal"></div>
</div> </div>
</div> </div>
{% endblock %} {% endblock %}

View file

@ -248,6 +248,30 @@
type="number" min="0" max="9999" value="{{ editable.bad_block_threshold }}"> type="number" min="0" max="9999" value="{{ editable.bad_block_threshold }}">
<span class="sf-hint">Max bad blocks before surface validate fails (Stage 7)</span> <span class="sf-hint">Max bad blocks before surface validate fails (Stage 7)</span>
</div> </div>
<div class="sf-row">
<label class="sf-label" for="surface_validate_block_size">Badblocks Block Size (bytes)</label>
<input class="sf-input sf-input-xs" id="surface_validate_block_size"
name="surface_validate_block_size" type="number" min="512" max="1048576" step="512"
value="{{ editable.surface_validate_block_size }}">
<span class="sf-hint">badblocks -b. 4096 (default) is conservative; 8192 is faster on multi-TB HDDs (~2x RAM, ~half the runtime). Power of 2.</span>
</div>
<div class="sf-row">
<label class="sf-label" for="surface_validate_block_buffer">Badblocks Block Buffer</label>
<input class="sf-input sf-input-xs" id="surface_validate_block_buffer"
name="surface_validate_block_buffer" type="number" min="1" max="4096"
value="{{ editable.surface_validate_block_buffer }}">
<span class="sf-hint">badblocks -c. 64 (default) matches the upstream tool. Buffer = block_size × this many blocks per IO.</span>
</div>
<div class="sf-row">
<label class="sf-label" for="surface_validate_passes">Badblocks Passes</label>
<input class="sf-input sf-input-xs" id="surface_validate_passes"
name="surface_validate_passes" type="number" min="0" max="16"
value="{{ editable.surface_validate_passes }}">
<span class="sf-hint">badblocks -p. 1 = repeat until one consecutive clean scan (default). 2-3 for paranoid burn-in that re-confirms after errors.</span>
</div>
</div> </div>
</div><!-- /right col --> </div><!-- /right col -->

0
tests/__init__.py Normal file
View file

283
tests/test_pool_parser.py Normal file
View file

@ -0,0 +1,283 @@
"""Unit tests for the zpool-list and lsblk parsers in ssh_client.
These cover the structural cases that drive the pool-membership lock:
mirror/raidz/draid container vdevs, single-disk vdevs at depth 1, the
flattened-indentation behaviour of `zpool list -vHP` on TrueNAS, partition
suffix stripping for NVMe and SCSI, and the cache/log/spare/special
section markers (including plural variants).
Run with: python -m unittest discover tests/ -v
"""
import unittest
from app.ssh_client import (
_parse_zpool_list_output,
_parse_lsblk_zfs_output,
_parse_smart_health_batch,
)
class TestParseZpoolList(unittest.TestCase):
def test_empty_output_returns_empty(self):
self.assertEqual(_parse_zpool_list_output(""), {})
def test_single_pool_with_mirror(self):
# TrueNAS-flattened output: pool at depth 0, vdev type and devices
# all at depth 1.
out = _parse_zpool_list_output(
"boot-pool\t232G\t8.4G\t224G\t-\t-\t17%\t3%\t1.00x\tONLINE\t-\n"
"\tmirror-0\t232G\t8.4G\t224G\t-\t-\t17%\t3.6%\t-\tONLINE\n"
"\t/dev/nvme0n1p3\t232G\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdd3\t232G\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out, {
"nvme0n1": {"pool": "boot-pool", "role": "data"},
"sdd": {"pool": "boot-pool", "role": "data"},
})
def test_raidz2_pool(self):
out = _parse_zpool_list_output(
"tank\t127T\t4.5T\t122T\t-\t-\t0%\t3%\t1.00x\tONLINE\t-\n"
"\traidz2-0\t127T\t4.5T\t122T\t-\t-\t0%\t3%\t-\tONLINE\n"
"\t/dev/sdc\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sde\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdf\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(set(out.keys()), {"sdc", "sde", "sdf"})
for v in out.values():
self.assertEqual(v, {"pool": "tank", "role": "data"})
def test_draid_pool(self):
out = _parse_zpool_list_output(
"warm\t100T\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tdraid2:8d:10c:1s-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdg\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdh\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["sdg"], {"pool": "warm", "role": "data"})
self.assertEqual(out["sdh"], {"pool": "warm", "role": "data"})
def test_single_disk_vdev_at_depth_1(self):
# No mirror/raidz wrapper — a `/dev/...` line itself sits at depth 1.
out = _parse_zpool_list_output(
"scratch\t1T\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\t/dev/sdi\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out, {"sdi": {"pool": "scratch", "role": "data"}})
def test_section_markers_switch_role(self):
# cache / log / spare / special / dedup all at depth 1; subsequent
# /dev/... lines (also at depth 1) inherit that role.
out = _parse_zpool_list_output(
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\tcache\n"
"\t/dev/nvme1n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\tlog\n"
"\t/dev/nvme2n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\tspare\n"
"\t/dev/sdz\t-\t-\t-\t-\t-\t-\t-\t-\tAVAIL\n"
)
self.assertEqual(out["sda"], {"pool": "tank", "role": "data"})
self.assertEqual(out["sdb"], {"pool": "tank", "role": "data"})
self.assertEqual(out["nvme1n1"], {"pool": "tank", "role": "cache"})
self.assertEqual(out["nvme2n1"], {"pool": "tank", "role": "log"})
self.assertEqual(out["sdz"], {"pool": "tank", "role": "spare"})
def test_section_markers_plurals_normalize(self):
# ZFS sometimes emits 'logs'/'spares' instead of 'log'/'spare'.
out = _parse_zpool_list_output(
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tlogs\n"
"\t/dev/nvme0n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\tspares\n"
"\t/dev/sdz\t-\t-\t-\t-\t-\t-\t-\t-\tAVAIL\n"
)
self.assertEqual(out["nvme0n1"]["role"], "log")
self.assertEqual(out["sdz"]["role"], "spare")
def test_special_and_dedup_section(self):
out = _parse_zpool_list_output(
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tspecial\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\tdedup\n"
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["sda"]["role"], "special")
self.assertEqual(out["sdb"]["role"], "dedup")
def test_partition_suffix_stripped(self):
out = _parse_zpool_list_output(
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sda3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/nvme0n1p3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertIn("sda", out)
self.assertNotIn("sda3", out)
self.assertIn("nvme0n1", out)
self.assertNotIn("nvme0n1p3", out)
def test_long_scsi_devname(self):
# Past sdz: sdaa, sdab, ...
out = _parse_zpool_list_output(
"big\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\traidz3-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdaa\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdab1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["sdaa"]["pool"], "big")
self.assertEqual(out["sdab"]["pool"], "big") # partition stripped
def test_pool_name_with_dashes_dots_underscores(self):
out = _parse_zpool_list_output(
"my-cool_pool.v2\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["sda"]["pool"], "my-cool_pool.v2")
def test_multiple_pools(self):
out = _parse_zpool_list_output(
"boot-pool\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/nvme0n1p3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdd3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\traidz2-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["nvme0n1"]["pool"], "boot-pool")
self.assertEqual(out["sdd"]["pool"], "boot-pool")
self.assertEqual(out["sda"]["pool"], "tank")
self.assertEqual(out["sdb"]["pool"], "tank")
def test_pool_role_resets_between_pools(self):
# Section marker in pool A must not carry into pool B.
out = _parse_zpool_list_output(
"a\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\tcache\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
"b\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out["sda"]["role"], "cache")
self.assertEqual(out["sdb"]["role"], "data")
def test_blank_lines_skipped(self):
out = _parse_zpool_list_output(
"\n"
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
"\n"
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
)
self.assertEqual(out, {"sda": {"pool": "tank", "role": "data"}})
class TestParseLsblkZfs(unittest.TestCase):
def test_empty_returns_empty_set(self):
self.assertEqual(_parse_lsblk_zfs_output(""), set())
def test_partition_zfs_member(self):
# Typical TrueNAS layout: zpool members are partitions.
out = _parse_lsblk_zfs_output(
"sda \n"
"sda1 \n"
"sda3 zfs_member\n"
"sdb \n"
"sdb3 zfs_member\n"
)
self.assertEqual(out, {"sda", "sdb"})
def test_whole_disk_zfs_member(self):
# Some configurations put zfs_member on the whole disk.
out = _parse_lsblk_zfs_output(
"sdc zfs_member\n"
)
self.assertEqual(out, {"sdc"})
def test_nvme_partitioned_and_whole(self):
out = _parse_lsblk_zfs_output(
"nvme0n1 \n"
"nvme0n1p3 zfs_member\n"
"nvme1n1 zfs_member\n"
)
self.assertEqual(out, {"nvme0n1", "nvme1n1"})
def test_non_zfs_fstypes_ignored(self):
out = _parse_lsblk_zfs_output(
"sda1 ext4\n"
"sda2 swap\n"
"sdb1 btrfs\n"
)
self.assertEqual(out, set())
def test_long_scsi_devnames(self):
out = _parse_lsblk_zfs_output(
"sdaa zfs_member\n"
"sdab1 zfs_member\n"
)
self.assertEqual(out, {"sdaa", "sdab"})
def test_short_lines_skipped(self):
out = _parse_lsblk_zfs_output(
"sda\n"
"\n"
"sdb1 zfs_member\n"
)
self.assertEqual(out, {"sdb"})
class TestParseSmartHealthBatch(unittest.TestCase):
def test_passed_drive(self):
out = _parse_smart_health_batch(
"@@sda@@\n"
"smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6]\n"
"SMART overall-health self-assessment test result: PASSED\n"
"@@END@@\n"
)
self.assertEqual(out, {"sda": "PASSED"})
def test_failed_drive(self):
out = _parse_smart_health_batch(
"@@sdb@@\n"
"SMART overall-health self-assessment test result: FAILED!\n"
"@@END@@\n"
)
self.assertEqual(out, {"sdb": "FAILED"})
def test_unknown_when_no_marker(self):
out = _parse_smart_health_batch(
"@@sdc@@\n"
"/dev/sdc: Unknown USB bridge\n"
"@@END@@\n"
)
self.assertEqual(out, {"sdc": "UNKNOWN"})
def test_multiple_drives_mixed_states(self):
out = _parse_smart_health_batch(
"@@sda@@\n"
"SMART overall-health self-assessment test result: PASSED\n"
"@@END@@\n"
"@@sdb@@\n"
"SMART overall-health self-assessment test result: FAILED!\n"
"@@END@@\n"
"@@nvme0n1@@\n"
"SMART overall-health self-assessment test result: PASSED\n"
"@@END@@\n"
)
self.assertEqual(out, {"sda": "PASSED", "sdb": "FAILED", "nvme0n1": "PASSED"})
def test_empty_returns_empty(self):
self.assertEqual(_parse_smart_health_batch(""), {})
if __name__ == "__main__":
unittest.main()

303
tests/test_unlock_flow.py Normal file
View file

@ -0,0 +1,303 @@
"""Unit tests for the pool-drive unlock state machine in burnin.py.
Covers: token validation per pool kind, identity-binding (grant
invalidated when pool_name/pool_role changes), TTL expiry, the
audit-commit-then-arm ordering (a failing audit insert leaves no
in-memory grant), and the unique-active-burnin partial index that
prevents duplicate queued rows for the same drive.
Uses an in-memory SQLite DB and monkeypatches app.config.settings.db_path.
No SSH, no network, no FastAPI.
Run with: python -m unittest discover tests/ -v
"""
import os
import tempfile
import time
import unittest
import aiosqlite
async def _setup_temp_db() -> str:
"""Create a temp SQLite file, point app.config at it, init schema.
Async-callable from IsolatedAsyncioTestCase.asyncSetUp."""
fd, path = tempfile.mkstemp(suffix=".db")
os.close(fd)
from app.config import settings
settings.db_path = path
from app.database import init_db
await init_db()
# Seed pool drives so unlock_flow tests have something to grant on.
async with aiosqlite.connect(path) as db:
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-1', 'sda', 'TESTSER1', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'tank', 'data', '2026-05-02T00:00:00+00:00')
""")
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-2', 'sdb', 'TESTSER2', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'boot-pool', 'data', '2026-05-02T00:00:00+00:00')
""")
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-3', 'sdc', 'TESTSER3', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'(exported)', 'exported', '2026-05-02T00:00:00+00:00')
""")
await db.commit()
return path
class TestUnlockFlow(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self):
self.db_path = await _setup_temp_db()
# Reset module state so previous test runs don't bleed in.
from app import burnin
burnin._unlock_grants.clear()
async def asyncTearDown(self):
try:
os.unlink(self.db_path)
except OSError:
pass
# ----- token validation per pool kind -----
async def test_active_pool_token_is_pool_name(self):
from app import burnin
# Drive 1 = tank/data
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "wrong", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertGreater(expiry, time.time())
async def test_boot_pool_token_is_destroy_phrase(self):
from app import burnin
# Drive 2 = boot-pool — typing the pool name must NOT work.
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(2, "boot-pool", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(
2, "DESTROY BOOT POOL", "op", "valid reason"
)
self.assertGreater(expiry, time.time())
async def test_exported_token_is_destroy_phrase(self):
from app import burnin
# Drive 3 = (exported)/exported
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(3, "(exported)", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(
3, "DESTROY EXPORTED POOL", "op", "valid reason"
)
self.assertGreater(expiry, time.time())
# ----- input validation -----
async def test_empty_reason_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "")
async def test_short_reason_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "hi")
async def test_empty_operator_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "", "valid reason")
async def test_unknown_drive_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(99999, "anything", "op", "valid reason")
async def test_drive_not_in_pool_rejected(self):
from app import burnin
# Manually clear pool fields on drive 1
async with aiosqlite.connect(self.db_path) as db:
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
await db.commit()
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# ----- identity binding (Codex finding #2) -----
async def test_grant_invalidated_when_pool_name_changes(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# Operator's grant references tank/data; pool detection now reports tank2.
self.assertTrue(burnin._is_unlocked(1, "tank", "data"))
self.assertFalse(burnin._is_unlocked(1, "tank2", "data"))
# And the side effect: the grant is reaped, not just temporarily denied.
self.assertNotIn(1, burnin._unlock_grants)
async def test_grant_invalidated_when_pool_role_changes(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# Same pool, different role (data -> cache).
self.assertFalse(burnin._is_unlocked(1, "tank", "cache"))
self.assertNotIn(1, burnin._unlock_grants)
async def test_unlock_expiry_returns_none_for_mismatched_identity(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertIsNotNone(burnin.unlock_expiry(1, "tank", "data"))
self.assertIsNone(burnin.unlock_expiry(1, "tank2", "data"))
# ----- TTL expiry -----
async def test_expired_grant_returns_false(self):
from app import burnin
# Drop TTL to 0 so the grant is born expired.
original = burnin.UNLOCK_TTL_SECONDS
burnin.UNLOCK_TTL_SECONDS = 0
try:
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertFalse(burnin._is_unlocked(1, "tank", "data"))
self.assertNotIn(1, burnin._unlock_grants)
finally:
burnin.UNLOCK_TTL_SECONDS = original
# ----- audit commit ordering (Codex finding #3) -----
async def test_audit_event_recorded_for_active_pool(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "alice", "swapping out drive")
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type, operator, message FROM audit_events "
"WHERE drive_id=? ORDER BY id DESC LIMIT 1", (1,)
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "pool_drive_unlocked")
self.assertEqual(row["operator"], "alice")
self.assertIn("swapping out drive", row["message"])
async def test_audit_event_for_boot_pool_uses_distinct_type(self):
from app import burnin
await burnin.grant_pool_unlock(
2, "DESTROY BOOT POOL", "alice", "replacing failed mirror"
)
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
(2,),
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "boot_pool_drive_unlocked")
async def test_audit_event_for_exported_uses_distinct_type(self):
from app import burnin
await burnin.grant_pool_unlock(
3, "DESTROY EXPORTED POOL", "alice", "decommissioned pool"
)
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
(3,),
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "exported_pool_drive_unlocked")
async def test_failed_token_does_not_record_audit_event(self):
from app import burnin
try:
await burnin.grant_pool_unlock(1, "wrong-token", "op", "valid reason")
except ValueError:
pass
async with aiosqlite.connect(self.db_path) as db:
cur = await db.execute(
"SELECT COUNT(*) FROM audit_events WHERE drive_id=?", (1,)
)
self.assertEqual((await cur.fetchone())[0], 0)
# And no in-memory grant was armed.
self.assertNotIn(1, burnin._unlock_grants)
class TestActiveJobUniqueIndex(unittest.IsolatedAsyncioTestCase):
"""Codex finding #4 — the partial unique index on burnin_jobs(drive_id)
WHERE state IN ('queued','running') must reject a second active row even
when two requests pass the SELECT-COUNT check concurrently."""
async def asyncSetUp(self):
self.db_path = await _setup_temp_db()
from app import burnin
burnin._unlock_grants.clear()
# Need to clear the pool field on drive 1 so unlock isn't required
# for these race tests.
async with aiosqlite.connect(self.db_path) as db:
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
await db.commit()
# Burnin orchestrator init for the semaphore
from app import burnin as b
import asyncio as _a
b._semaphore = _a.Semaphore(4)
async def asyncTearDown(self):
try:
os.unlink(self.db_path)
except OSError:
pass
async def test_index_blocks_second_active_insert(self):
# Insert a queued row by hand, then try a second one — index fires.
async with aiosqlite.connect(self.db_path) as db:
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
with self.assertRaises(aiosqlite.IntegrityError):
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:01+00:00"),
)
await db.commit()
async def test_index_allows_terminal_state_then_new_job(self):
# passed/failed/cancelled/unknown rows must not block a fresh queue.
async with aiosqlite.connect(self.db_path) as db:
for state in ("passed", "failed", "cancelled", "unknown"):
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", state, 100, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
# Should succeed — no other queued/running row exists.
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
if __name__ == "__main__":
unittest.main()