feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.
1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
(sshd ignores SSH signal-channel requests by default), so a cancel of a
long-running badblocks left the remote process running and proc.wait()
hanging — pinning the asyncio.Semaphore slot forever.
* Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
capture the remote PID; store in burnin._remote_pids[job_id].
* burnin._kill_remote_process(job_id) opens a fresh SSH session and
issues `kill -9 <pid>` — sshd honours that.
* Bound proc.wait() with asyncio.wait_for(timeout=15).
* burnin._active_tasks tracks every _run_job task so cancel_job and
check_stuck_jobs can actually cancel the asyncio task (was DB-only
before). Also fixes the documented asyncio.create_task GC gotcha
(weak refs only).
* _run_job finalizer reads current state and skips the write if state
!= 'running' so cancelled/unknown aren't clobbered.
1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
health/poll timestamps; devname/serial/model/size_bytes were stuck at
first-INSERT values forever. After kernel SCSI re-enumeration two
drives could both show as `sda`. Fixed by updating all six fields.
Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
off the dashboard while audit/burnin_jobs FKs stay intact.
1.0.0-15/-16 — pool-membership lock.
* ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
the flattened TrueNAS output (container vdevs + their device children
both appear at depth 1; section markers cache/log/spare/special/dedup
switch the role).
* ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
to detect drives carrying ZFS labels not in any active pool — they
get pool_name='(exported)', pool_role='exported'.
* Three idempotent ALTER TABLE migrations on drives:
pool_name/pool_role/pool_seen_at.
* burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
the drive isn't in burnin._unlock_grants. Routes layer maps to 409
with structured detail {pool_name, pool_role, pool_locked: true} so
the frontend can render an unlock affordance.
* POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
pool_drive_unlocked / boot_pool_drive_unlocked /
exported_pool_drive_unlocked.
* Grants are in-memory only — container restart wipes them.
* UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
Burn-In button. modal_unlock.html with type-to-confirm field.
Live unlock countdown via tickUnlockCountdowns() in app.js.
* Daily report: red banner listing every unlock event from the last
24h, with operator + reason + timestamp.
1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
* ssh_client.get_pool_membership / get_zfs_member_drives now return
None on failure (vs {} for 'definitely empty'). poller passes
update_pool=False to _upsert_drive on detection failure, preserving
existing pool columns instead of clearing them. Without this fix a
1-second SSH blip silently unlocked every drive.
* mailer._build_unlock_banner_html escapes every interpolated field
via html.escape() (was '<' only). Time filter switched to
julianday() — string >= against datetime('now', '-1 day') compared
formats with different separators ('T' vs ' ') and timezone
suffixes, causing subtle off-by-N-hour inclusion.
* app.js submitStart/submitBatchStart now detect the structured
pool_locked 409 detail and auto-open the unlock modal for the
offending drive (was [object Object] in toast).
1.0.0-18 — Codex grant-binding + commit-ordering fixes.
* Unlock grants bound to the (pool_name, pool_role) observed at unlock
time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
invalidate the grant if the live row's pool identity has changed.
Prevents an 'exported' unlock from carrying over when the drive
turns out to be in active 'tank' or 'boot-pool'.
* grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
succeeds — previously a failed audit insert left an unaudited grant
armed.
1.0.0-19 — Codex race + cancellation classification + test scaffold.
* Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
(drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
try/except aiosqlite.IntegrityError -> ValueError so the read-then-
insert race in start_job can't produce two queued rows for the same
drive.
* _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
future code paths) where DB state is still 'running', finalizer
writes 'unknown' instead of mis-classifying as 'failed'.
* tests/ stdlib unittest scaffold:
- test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
single-disk depth-1, plural section markers, partition stripping,
sdaa-style names, multi-pool, role reset between pools.
- test_unlock_flow.py (18 tests): token validation per pool kind,
identity-binding invalidation, TTL expiry, audit-commit-then-arm
ordering, unique-active-burnin partial index.
Run via `python -m unittest discover tests/`. No new dependencies.
1.0.0-20 — Spearfoot-inspired badblocks tunables.
* surface_validate_block_size (-b, default 4096), surface_validate_
block_buffer (-c, default 64), surface_validate_passes (-p, default
1) exposed in Settings UI; persist via settings_store.json.
Validation: block size must be a power of 2 between 512 and
1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
roughly halves runtime on multi-TB HDDs at ~2x RAM cost.
1.0.0-21 — SMART overall-health column actually populated.
* /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
to UNKNOWN forever (only burn-in stages ever wrote a real value).
* ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
all drives in a single SSH session, deterministically delimited with
@@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
UNKNOWN} or None on SSH failure.
* poller calls it every 5th cycle (~1 min at default 12s interval),
caches in _state['smart_health_cache'] so transient failures preserve
the previous values.
* Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
-> 6 so Short/Long SMART columns fit comfortably on a 13-inch
display.
* 5 additional parser tests (44 total, all passing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
b85bac7686
commit
5da1a1704f
18 changed files with 2623 additions and 311 deletions
595
app/burnin.py
595
app/burnin.py
|
|
@ -19,6 +19,7 @@ Cancellation:
|
||||||
import asyncio
|
import asyncio
|
||||||
import logging
|
import logging
|
||||||
import time
|
import time
|
||||||
|
from contextlib import asynccontextmanager
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
import aiosqlite
|
import aiosqlite
|
||||||
|
|
@ -66,14 +67,29 @@ POLL_INTERVAL = 5.0 # seconds between progress checks during active stages
|
||||||
_semaphore: asyncio.Semaphore | None = None
|
_semaphore: asyncio.Semaphore | None = None
|
||||||
_client: TrueNASClient | None = None
|
_client: TrueNASClient | None = None
|
||||||
|
|
||||||
|
# Live job tracking — keeps a strong reference to every _run_job task so it
|
||||||
|
# isn't garbage-collected (asyncio.create_task only keeps a weak ref) and so
|
||||||
|
# cancel_job / check_stuck_jobs can actually unwedge a stuck task.
|
||||||
|
_active_tasks: dict[int, "asyncio.Task"] = {}
|
||||||
|
|
||||||
|
# Remote PID of any long-running SSH child process (currently only badblocks)
|
||||||
|
# so we can kill it via a fresh SSH session — proc.kill() over asyncssh sends
|
||||||
|
# a "signal" channel request that OpenSSH sshd ignores by default, leaving
|
||||||
|
# the remote process running and proc.wait() hanging forever.
|
||||||
|
_remote_pids: dict[int, int] = {}
|
||||||
|
|
||||||
|
|
||||||
def _now() -> str:
|
def _now() -> str:
|
||||||
return datetime.now(timezone.utc).isoformat()
|
return datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
|
||||||
def _db():
|
@asynccontextmanager
|
||||||
"""Open a fresh WAL-mode connection. Caller must use 'async with'."""
|
async def _db():
|
||||||
return aiosqlite.connect(settings.db_path)
|
"""Open a WAL-mode connection with busy_timeout so writers wait for the lock
|
||||||
|
instead of immediately raising 'database is locked' under contention."""
|
||||||
|
async with aiosqlite.connect(settings.db_path) as db:
|
||||||
|
await db.execute("PRAGMA busy_timeout=10000")
|
||||||
|
yield db
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
@ -104,11 +120,228 @@ async def init(client: TrueNASClient) -> None:
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
for job_id in queued:
|
for job_id in queued:
|
||||||
asyncio.create_task(_run_job(job_id))
|
_spawn_run_job(job_id)
|
||||||
|
|
||||||
log.info("Burn-in orchestrator ready (max_concurrent=%d)", settings.max_parallel_burnins)
|
log.info("Burn-in orchestrator ready (max_concurrent=%d)", settings.max_parallel_burnins)
|
||||||
|
|
||||||
|
|
||||||
|
def _spawn_run_job(job_id: int) -> "asyncio.Task":
|
||||||
|
"""Schedule a _run_job task and keep a strong reference to it.
|
||||||
|
|
||||||
|
Plain asyncio.create_task() only leaves a weak reference behind, so the
|
||||||
|
task can be GC'd before it ever runs. Storing it in _active_tasks also
|
||||||
|
lets cancel_job / check_stuck_jobs cancel it directly.
|
||||||
|
"""
|
||||||
|
task = asyncio.create_task(_run_job(job_id))
|
||||||
|
_active_tasks[job_id] = task
|
||||||
|
|
||||||
|
def _cleanup(t: "asyncio.Task") -> None:
|
||||||
|
# Remove only if it's still us — avoid clobbering a re-enqueued task.
|
||||||
|
if _active_tasks.get(job_id) is t:
|
||||||
|
_active_tasks.pop(job_id, None)
|
||||||
|
_remote_pids.pop(job_id, None)
|
||||||
|
|
||||||
|
task.add_done_callback(_cleanup)
|
||||||
|
return task
|
||||||
|
|
||||||
|
|
||||||
|
async def _kill_remote_process(job_id: int) -> None:
|
||||||
|
"""Send kill -9 to the remote PID associated with this job, if any.
|
||||||
|
|
||||||
|
asyncssh's proc.kill() sends an SSH 'signal' channel request which
|
||||||
|
OpenSSH's sshd does not honor by default. Opening a fresh session and
|
||||||
|
running /bin/kill is the reliable way to actually terminate the process.
|
||||||
|
"""
|
||||||
|
pid = _remote_pids.pop(job_id, None)
|
||||||
|
if not pid:
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
from app import ssh_client
|
||||||
|
async with await ssh_client._connect() as conn:
|
||||||
|
await asyncio.wait_for(
|
||||||
|
conn.run(f"kill -9 {pid} 2>/dev/null || true", check=False),
|
||||||
|
timeout=10,
|
||||||
|
)
|
||||||
|
log.info("Remote-killed PID %d for job %d", pid, job_id)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("Failed to remote-kill PID %d for job %d: %s", pid, job_id, exc)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Pool-drive unlock state
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
#
|
||||||
|
# Drives that ZFS reports as belonging to an active zpool (including the
|
||||||
|
# boot pool) are locked from burn-in until the operator explicitly unlocks
|
||||||
|
# them via POST /api/v1/drives/{id}/unlock. Grants live in memory only —
|
||||||
|
# a container restart wipes them, which is the right default for "this is
|
||||||
|
# very dangerous." TTL is bounded so an unlock you forgot about can't sit
|
||||||
|
# armed indefinitely.
|
||||||
|
|
||||||
|
import time as _time
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
UNLOCK_TTL_SECONDS = 600 # 10 minutes
|
||||||
|
BOOT_POOL_NAME = "boot-pool"
|
||||||
|
BOOT_POOL_CONFIRM_TOKEN = "DESTROY BOOT POOL"
|
||||||
|
EXPORTED_POOL_ROLE = "exported"
|
||||||
|
EXPORTED_CONFIRM_TOKEN = "DESTROY EXPORTED POOL"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class _UnlockGrant:
|
||||||
|
"""An operator-issued, time-bounded permission to burn-in a pool drive.
|
||||||
|
|
||||||
|
The grant is BOUND to the (pool_name, pool_role) observed at unlock
|
||||||
|
time. If a subsequent poll reclassifies the drive — e.g. it was
|
||||||
|
"(exported)" when unlocked but is now in active pool "tank", or it
|
||||||
|
used to be a cache vdev and now shows as data — the grant is
|
||||||
|
invalidated. Otherwise the operator's "I confirm this exported drive
|
||||||
|
is decommissioned" judgement would silently authorise destruction
|
||||||
|
of a live pool.
|
||||||
|
"""
|
||||||
|
expiry: float
|
||||||
|
pool_name: str
|
||||||
|
pool_role: str | None
|
||||||
|
|
||||||
|
|
||||||
|
_unlock_grants: dict[int, _UnlockGrant] = {}
|
||||||
|
|
||||||
|
|
||||||
|
class PoolMemberError(Exception):
|
||||||
|
"""Raised by start_job when a drive is in a zpool and not unlocked."""
|
||||||
|
def __init__(self, drive_id: int, pool_name: str, pool_role: str | None):
|
||||||
|
self.drive_id = drive_id
|
||||||
|
self.pool_name = pool_name
|
||||||
|
self.pool_role = pool_role
|
||||||
|
is_boot = pool_name == BOOT_POOL_NAME
|
||||||
|
super().__init__(
|
||||||
|
f"Drive is part of {'BOOT POOL' if is_boot else 'pool'} "
|
||||||
|
f"'{pool_name}'{' (' + pool_role + ')' if pool_role else ''}. "
|
||||||
|
f"Unlock required before burn-in."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _is_unlocked(drive_id: int, current_pool_name: str | None,
|
||||||
|
current_pool_role: str | None) -> bool:
|
||||||
|
"""True iff a non-expired grant exists AND the drive's pool identity
|
||||||
|
matches what was observed at unlock time."""
|
||||||
|
grant = _unlock_grants.get(drive_id)
|
||||||
|
if grant is None:
|
||||||
|
return False
|
||||||
|
if _time.time() >= grant.expiry:
|
||||||
|
_unlock_grants.pop(drive_id, None)
|
||||||
|
return False
|
||||||
|
if grant.pool_name != current_pool_name or grant.pool_role != current_pool_role:
|
||||||
|
# Pool identity changed since unlock — drive may now belong to a
|
||||||
|
# different (or live) pool. Invalidate the grant; operator must
|
||||||
|
# re-unlock with eyes-open against the current state.
|
||||||
|
_unlock_grants.pop(drive_id, None)
|
||||||
|
log.warning(
|
||||||
|
"Invalidating unlock grant for drive_id=%d: pool changed from "
|
||||||
|
"(%s, %s) to (%s, %s)",
|
||||||
|
drive_id, grant.pool_name, grant.pool_role,
|
||||||
|
current_pool_name, current_pool_role,
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def unlock_expiry(drive_id: int, current_pool_name: str | None,
|
||||||
|
current_pool_role: str | None) -> float | None:
|
||||||
|
"""Return the absolute expiry of an active grant, or None.
|
||||||
|
|
||||||
|
Same identity-binding semantics as _is_unlocked: a grant whose stored
|
||||||
|
pool identity no longer matches the current row is treated as expired
|
||||||
|
and reaped. This is what the dashboard reads to decide whether to show
|
||||||
|
the unlocked-Burn-In affordance vs the locked-Unlock affordance.
|
||||||
|
"""
|
||||||
|
grant = _unlock_grants.get(drive_id)
|
||||||
|
if grant is None:
|
||||||
|
return None
|
||||||
|
if _time.time() >= grant.expiry:
|
||||||
|
_unlock_grants.pop(drive_id, None)
|
||||||
|
return None
|
||||||
|
if grant.pool_name != current_pool_name or grant.pool_role != current_pool_role:
|
||||||
|
_unlock_grants.pop(drive_id, None)
|
||||||
|
return None
|
||||||
|
return grant.expiry
|
||||||
|
|
||||||
|
|
||||||
|
async def grant_pool_unlock(drive_id: int, confirm_token: str,
|
||||||
|
operator: str, reason: str) -> float:
|
||||||
|
"""Validate confirmation token + reason and grant a time-limited unlock.
|
||||||
|
|
||||||
|
Raises ValueError on bad confirm_token, missing reason, or drive not
|
||||||
|
actually in a pool. Returns the unix expiry timestamp on success.
|
||||||
|
"""
|
||||||
|
if not reason or len(reason.strip()) < 5:
|
||||||
|
raise ValueError("A reason of at least 5 characters is required.")
|
||||||
|
if not operator or not operator.strip():
|
||||||
|
raise ValueError("Operator name is required.")
|
||||||
|
|
||||||
|
async with _db() as db:
|
||||||
|
db.row_factory = aiosqlite.Row
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT pool_name, pool_role, devname FROM drives WHERE id=?",
|
||||||
|
(drive_id,),
|
||||||
|
)
|
||||||
|
row = await cur.fetchone()
|
||||||
|
if not row:
|
||||||
|
raise ValueError("Drive not found.")
|
||||||
|
pool_name = row["pool_name"]
|
||||||
|
pool_role = row["pool_role"]
|
||||||
|
if not pool_name:
|
||||||
|
raise ValueError(
|
||||||
|
"This drive is not part of any pool — no unlock needed."
|
||||||
|
)
|
||||||
|
|
||||||
|
# Boot-pool and exported pools both get dedicated, harder-to-fat-
|
||||||
|
# finger tokens. Active data pools just need their pool name typed.
|
||||||
|
if pool_name == BOOT_POOL_NAME:
|
||||||
|
expected = BOOT_POOL_CONFIRM_TOKEN
|
||||||
|
elif pool_role == EXPORTED_POOL_ROLE:
|
||||||
|
expected = EXPORTED_CONFIRM_TOKEN
|
||||||
|
else:
|
||||||
|
expected = pool_name
|
||||||
|
if (confirm_token or "").strip() != expected:
|
||||||
|
raise ValueError("Confirmation token does not match.")
|
||||||
|
|
||||||
|
if pool_name == BOOT_POOL_NAME:
|
||||||
|
evt = "boot_pool_drive_unlocked"
|
||||||
|
elif pool_role == EXPORTED_POOL_ROLE:
|
||||||
|
evt = "exported_pool_drive_unlocked"
|
||||||
|
else:
|
||||||
|
evt = "pool_drive_unlocked"
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO audit_events
|
||||||
|
(event_type, drive_id, burnin_job_id, operator, message)
|
||||||
|
VALUES (?,?,?,?,?)""",
|
||||||
|
(evt, drive_id, None, operator.strip(),
|
||||||
|
f"Unlocked {pool_name} drive {row['devname']} for burn-in: {reason.strip()}"),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
|
||||||
|
# Arm the in-memory grant ONLY after the audit row is durable. If the
|
||||||
|
# commit above raises, we exit without writing _unlock_grants — no
|
||||||
|
# unaudited active unlocks. The grant is bound to the (pool_name,
|
||||||
|
# pool_role) we observed under the open transaction so a later poll
|
||||||
|
# that reclassifies the drive invalidates it (see _is_unlocked).
|
||||||
|
expiry = _time.time() + UNLOCK_TTL_SECONDS
|
||||||
|
_unlock_grants[drive_id] = _UnlockGrant(
|
||||||
|
expiry=expiry,
|
||||||
|
pool_name=pool_name,
|
||||||
|
pool_role=pool_role,
|
||||||
|
)
|
||||||
|
|
||||||
|
log.warning(
|
||||||
|
"Pool-drive unlock granted: drive_id=%d pool=%s role=%s "
|
||||||
|
"operator=%s reason=%r",
|
||||||
|
drive_id, pool_name, pool_role, operator, reason,
|
||||||
|
)
|
||||||
|
return expiry
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Public API
|
# Public API
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
@ -142,13 +375,35 @@ async def start_job(drive_id: int, profile: str, operator: str,
|
||||||
if (await cur.fetchone())[0] > 0:
|
if (await cur.fetchone())[0] > 0:
|
||||||
raise ValueError("Drive already has an active burn-in job")
|
raise ValueError("Drive already has an active burn-in job")
|
||||||
|
|
||||||
# Create job
|
# Pool-membership gate: locked unless the operator explicitly
|
||||||
|
# unlocked this drive via /api/v1/drives/{id}/unlock recently.
|
||||||
|
# _is_unlocked also checks that the grant's stored (pool_name,
|
||||||
|
# pool_role) still matches the live row — a grant issued for an
|
||||||
|
# exported drive doesn't carry over if the drive turns out to be
|
||||||
|
# in an active pool on the next poll.
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT pool_name, pool_role FROM drives WHERE id=?", (drive_id,)
|
||||||
|
)
|
||||||
|
drow = await cur.fetchone()
|
||||||
|
if drow and drow["pool_name"] and not _is_unlocked(
|
||||||
|
drive_id, drow["pool_name"], drow["pool_role"]
|
||||||
|
):
|
||||||
|
raise PoolMemberError(drive_id, drow["pool_name"], drow["pool_role"])
|
||||||
|
|
||||||
|
# Create job. The partial unique index uniq_active_burnin_per_drive
|
||||||
|
# (database.py) is the actual race-stopper here: if two concurrent
|
||||||
|
# /api/v1/burnin/start calls both pass the SELECT-COUNT check above,
|
||||||
|
# only one INSERT can win; the loser raises IntegrityError, which
|
||||||
|
# we surface with the same ValueError as the inline duplicate check.
|
||||||
|
try:
|
||||||
cur = await db.execute(
|
cur = await db.execute(
|
||||||
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
||||||
VALUES (?,?,?,?,?,?) RETURNING id""",
|
VALUES (?,?,?,?,?,?) RETURNING id""",
|
||||||
(drive_id, profile, "queued", 0, operator, now),
|
(drive_id, profile, "queued", 0, operator, now),
|
||||||
)
|
)
|
||||||
job_id = (await cur.fetchone())["id"]
|
job_id = (await cur.fetchone())["id"]
|
||||||
|
except aiosqlite.IntegrityError:
|
||||||
|
raise ValueError("Drive already has an active burn-in job")
|
||||||
|
|
||||||
# Create stage rows in the desired execution order
|
# Create stage rows in the desired execution order
|
||||||
for stage_name in stages:
|
for stage_name in stages:
|
||||||
|
|
@ -164,7 +419,7 @@ async def start_job(drive_id: int, profile: str, operator: str,
|
||||||
)
|
)
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
asyncio.create_task(_run_job(job_id))
|
_spawn_run_job(job_id)
|
||||||
log.info("Burn-in job %d queued (drive_id=%d profile=%s operator=%s)",
|
log.info("Burn-in job %d queued (drive_id=%d profile=%s operator=%s)",
|
||||||
job_id, drive_id, profile, operator)
|
job_id, drive_id, profile, operator)
|
||||||
return job_id
|
return job_id
|
||||||
|
|
@ -198,6 +453,13 @@ async def cancel_job(job_id: int, operator: str) -> bool:
|
||||||
)
|
)
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
|
# Kill the remote child process FIRST (so proc.wait() in the running task
|
||||||
|
# can return), then cancel the task so any other awaits unblock.
|
||||||
|
await _kill_remote_process(job_id)
|
||||||
|
task = _active_tasks.get(job_id)
|
||||||
|
if task and not task.done():
|
||||||
|
task.cancel()
|
||||||
|
|
||||||
log.info("Burn-in job %d cancelled by %s", job_id, operator)
|
log.info("Burn-in job %d cancelled by %s", job_id, operator)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
@ -206,10 +468,45 @@ async def cancel_job(job_id: int, operator: str) -> bool:
|
||||||
# Job runner
|
# Job runner
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def _thermal_gate_ok() -> bool:
|
||||||
|
"""True if it's thermally safe to start a new burn-in.
|
||||||
|
Checks the peak temperature of drives currently under active burn-in.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
async with _db() as db:
|
||||||
|
cur = await db.execute("""
|
||||||
|
SELECT MAX(d.temperature_c)
|
||||||
|
FROM drives d
|
||||||
|
JOIN burnin_jobs bj ON bj.drive_id = d.id
|
||||||
|
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
|
||||||
|
""")
|
||||||
|
row = await cur.fetchone()
|
||||||
|
max_temp = row[0] if row and row[0] is not None else None
|
||||||
|
return max_temp is None or max_temp < settings.temp_warn_c
|
||||||
|
except Exception:
|
||||||
|
return True # Never block on error
|
||||||
|
|
||||||
|
|
||||||
async def _run_job(job_id: int) -> None:
|
async def _run_job(job_id: int) -> None:
|
||||||
"""Acquire semaphore slot, execute all stages, persist final state."""
|
"""Acquire semaphore slot, execute all stages, persist final state."""
|
||||||
assert _semaphore is not None, "burnin.init() not called"
|
assert _semaphore is not None, "burnin.init() not called"
|
||||||
|
|
||||||
|
# Adaptive thermal gate: wait before competing for a slot if running drives
|
||||||
|
# are already at or above the warning threshold. This prevents layering a
|
||||||
|
# new burn-in on top of a thermally-stressed system. Gives up after 3 min
|
||||||
|
# and proceeds anyway so jobs don't queue indefinitely.
|
||||||
|
for _attempt in range(18): # 18 × 10 s = 3 min max
|
||||||
|
if await _thermal_gate_ok():
|
||||||
|
break
|
||||||
|
if _attempt == 0:
|
||||||
|
log.info(
|
||||||
|
"Thermal gate: job %d waiting — running drive temps at or above %d°C",
|
||||||
|
job_id, settings.temp_warn_c,
|
||||||
|
)
|
||||||
|
await asyncio.sleep(10)
|
||||||
|
else:
|
||||||
|
log.warning("Thermal gate timed out for job %d — proceeding anyway", job_id)
|
||||||
|
|
||||||
async with _semaphore:
|
async with _semaphore:
|
||||||
if await _is_cancelled(job_id):
|
if await _is_cancelled(job_id):
|
||||||
return
|
return
|
||||||
|
|
@ -254,17 +551,34 @@ async def _run_job(job_id: int) -> None:
|
||||||
|
|
||||||
success = False
|
success = False
|
||||||
error_text = None
|
error_text = None
|
||||||
|
was_cancelled = False
|
||||||
try:
|
try:
|
||||||
success = await _execute_stages(job_id, job_stages, devname, drive_id)
|
success = await _execute_stages(job_id, job_stages, devname, drive_id)
|
||||||
except asyncio.CancelledError:
|
except asyncio.CancelledError:
|
||||||
pass
|
was_cancelled = True
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
error_text = str(exc)
|
error_text = str(exc)
|
||||||
log.exception("Burn-in raised exception", extra={"job_id": job_id, "devname": devname})
|
log.exception("Burn-in raised exception", extra={"job_id": job_id, "devname": devname})
|
||||||
|
|
||||||
if await _is_cancelled(job_id):
|
# If the job has already moved to a terminal state — by cancel_job
|
||||||
|
# ('cancelled') or check_stuck_jobs ('unknown') — leave it alone. The
|
||||||
|
# task may have been cancelled mid-stage; finalizing as 'failed' would
|
||||||
|
# clobber that audit-meaningful terminal state.
|
||||||
|
async with _db() as db:
|
||||||
|
cur = await db.execute("SELECT state FROM burnin_jobs WHERE id=?", (job_id,))
|
||||||
|
cur_row = await cur.fetchone()
|
||||||
|
if cur_row and cur_row[0] != "running":
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# Cancellation arriving here means the asyncio task was cancelled
|
||||||
|
# by something other than cancel_job/check_stuck_jobs (shutdown,
|
||||||
|
# uvicorn reload, future code paths). The DB still says 'running',
|
||||||
|
# so we have to write *some* terminal state, but classifying the
|
||||||
|
# interrupted job as 'failed' would lie — we don't actually know
|
||||||
|
# whether the underlying SMART/badblocks work passed or not.
|
||||||
|
if was_cancelled:
|
||||||
|
final_state = "unknown"
|
||||||
|
else:
|
||||||
final_state = "passed" if success else "failed"
|
final_state = "passed" if success else "failed"
|
||||||
async with _db() as db:
|
async with _db() as db:
|
||||||
await db.execute("PRAGMA journal_mode=WAL")
|
await db.execute("PRAGMA journal_mode=WAL")
|
||||||
|
|
@ -464,6 +778,14 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
|
||||||
# Brief pause to let the test register in smartctl output
|
# Brief pause to let the test register in smartctl output
|
||||||
await asyncio.sleep(3)
|
await asyncio.sleep(3)
|
||||||
|
|
||||||
|
# Throttle log_text appends — every poll on a multi-hour long_smart bloated
|
||||||
|
# log_text to 50+ MB and triggered SQLite "database is locked" because each
|
||||||
|
# COALESCE-then-append rewrites the whole column. Append every ~60s, on the
|
||||||
|
# first poll, and on any state change.
|
||||||
|
LOG_EVERY_N_POLLS = 12
|
||||||
|
poll_count = 0
|
||||||
|
last_state: str | None = None
|
||||||
|
|
||||||
# Poll until complete
|
# Poll until complete
|
||||||
while True:
|
while True:
|
||||||
if await _is_cancelled(job_id):
|
if await _is_cancelled(job_id):
|
||||||
|
|
@ -482,6 +804,10 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
|
||||||
await _append_stage_log(job_id, stage_name, f"[poll error] {exc}\n")
|
await _append_stage_log(job_id, stage_name, f"[poll error] {exc}\n")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
|
poll_count += 1
|
||||||
|
state_changed = progress["state"] != last_state
|
||||||
|
last_state = progress["state"]
|
||||||
|
if poll_count == 1 or poll_count % LOG_EVERY_N_POLLS == 0 or state_changed:
|
||||||
await _append_stage_log(job_id, stage_name, progress["output"] + "\n---\n")
|
await _append_stage_log(job_id, stage_name, progress["output"] + "\n---\n")
|
||||||
|
|
||||||
if progress["state"] == "running":
|
if progress["state"] == "running":
|
||||||
|
|
@ -519,15 +845,39 @@ async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage
|
||||||
# "unknown" → keep polling
|
# "unknown" → keep polling
|
||||||
|
|
||||||
|
|
||||||
|
async def _badblocks_available() -> bool:
|
||||||
|
"""Check if badblocks is installed on the remote host (Linux/SCALE only)."""
|
||||||
|
from app import ssh_client
|
||||||
|
try:
|
||||||
|
async with await ssh_client._connect() as conn:
|
||||||
|
result = await conn.run("which badblocks", check=False)
|
||||||
|
return result.returncode == 0
|
||||||
|
except Exception:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
async def _stage_surface_validate(job_id: int, devname: str, drive_id: int) -> bool:
|
async def _stage_surface_validate(job_id: int, devname: str, drive_id: int) -> bool:
|
||||||
"""
|
"""
|
||||||
Surface validation stage.
|
Surface validation stage — auto-routes to the right implementation:
|
||||||
SSH mode: runs badblocks -wsv -b 4096 -p 1 /dev/{devname}.
|
|
||||||
Mock mode: simulated timed progress (no real I/O).
|
1. SSH configured + badblocks available (TrueNAS SCALE / Linux):
|
||||||
|
→ runs badblocks -wsv -b 4096 -p 1 /dev/{devname} directly over SSH.
|
||||||
|
2. SSH configured + badblocks NOT available (TrueNAS CORE / FreeBSD):
|
||||||
|
→ uses TrueNAS REST API disk.wipe FULL job + post-wipe SMART check.
|
||||||
|
3. No SSH:
|
||||||
|
→ simulated timed progress (dev/mock mode).
|
||||||
"""
|
"""
|
||||||
from app import ssh_client
|
from app import ssh_client
|
||||||
if ssh_client.is_configured():
|
if ssh_client.is_configured():
|
||||||
|
if await _badblocks_available():
|
||||||
return await _stage_surface_validate_ssh(job_id, devname, drive_id)
|
return await _stage_surface_validate_ssh(job_id, devname, drive_id)
|
||||||
|
# TrueNAS CORE/FreeBSD: badblocks not available — use native wipe API
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
"[INFO] badblocks not found on host (TrueNAS CORE/FreeBSD) — "
|
||||||
|
"using TrueNAS disk.wipe API (FULL write pass).\n\n"
|
||||||
|
)
|
||||||
|
return await _stage_surface_validate_truenas(job_id, devname, drive_id)
|
||||||
return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds)
|
return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -537,8 +887,11 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
|
|
||||||
await _append_stage_log(
|
await _append_stage_log(
|
||||||
job_id, "surface_validate",
|
job_id, "surface_validate",
|
||||||
f"[START] badblocks -wsv -b 4096 -p 1 /dev/{devname}\n"
|
f"[START] badblocks -wsv -b {settings.surface_validate_block_size} "
|
||||||
f"[NOTE] This is a DESTRUCTIVE write test. All data on /dev/{devname} will be overwritten.\n\n"
|
f"-c {settings.surface_validate_block_buffer} "
|
||||||
|
f"-p {settings.surface_validate_passes} /dev/{devname}\n"
|
||||||
|
f"[NOTE] This is a DESTRUCTIVE write test. "
|
||||||
|
f"All data on /dev/{devname} will be overwritten.\n\n"
|
||||||
)
|
)
|
||||||
|
|
||||||
def _is_cancelled_sync() -> bool:
|
def _is_cancelled_sync() -> bool:
|
||||||
|
|
@ -580,14 +933,50 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
output_lines: list[str] = []
|
output_lines: list[str] = []
|
||||||
|
|
||||||
async with await ssh_client._connect() as conn:
|
async with await ssh_client._connect() as conn:
|
||||||
cmd = f"badblocks -wsv -b 4096 -p 1 /dev/{devname}"
|
# Wrap in `sh -c 'echo PID:$$; exec ...'` so we get the remote
|
||||||
|
# PID on the first stdout line. asyncssh's proc.kill() sends an
|
||||||
|
# SSH signal request that OpenSSH's sshd ignores by default, so
|
||||||
|
# we need the PID to issue an out-of-band `kill -9` over a fresh
|
||||||
|
# session when we want to abort.
|
||||||
|
#
|
||||||
|
# Block geometry is operator-tunable (Settings → Burn-in):
|
||||||
|
# -b N block size in bytes (settings.surface_validate_block_size)
|
||||||
|
# -c N blocks held per IO (settings.surface_validate_block_buffer)
|
||||||
|
# -p N pass count (settings.surface_validate_passes)
|
||||||
|
# Defaults preserve original behavior (-b 4096 -c 64 -p 1).
|
||||||
|
bb_args = (
|
||||||
|
f"-wsv "
|
||||||
|
f"-b {settings.surface_validate_block_size} "
|
||||||
|
f"-c {settings.surface_validate_block_buffer} "
|
||||||
|
f"-p {settings.surface_validate_passes}"
|
||||||
|
)
|
||||||
|
cmd = (
|
||||||
|
f"sh -c 'echo PID:$$; exec badblocks {bb_args} /dev/{devname}'"
|
||||||
|
)
|
||||||
async with conn.create_process(cmd) as proc:
|
async with conn.create_process(cmd) as proc:
|
||||||
import re as _re
|
import re as _re
|
||||||
|
|
||||||
|
pid_seen = False
|
||||||
|
|
||||||
async def _drain(stream, is_stderr: bool):
|
async def _drain(stream, is_stderr: bool):
|
||||||
nonlocal bad_blocks_total
|
nonlocal bad_blocks_total, pid_seen
|
||||||
async for raw in stream:
|
async for raw in stream:
|
||||||
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
|
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
|
||||||
|
|
||||||
|
# First stdout line is "PID:<n>" from the wrapping shell.
|
||||||
|
# Capture it and don't append it to the user-visible log.
|
||||||
|
if not is_stderr and not pid_seen and line.startswith("PID:"):
|
||||||
|
pid_seen = True
|
||||||
|
try:
|
||||||
|
_remote_pids[job_id] = int(line[4:].strip())
|
||||||
|
log.info(
|
||||||
|
"Captured remote PID %d for job %d (badblocks)",
|
||||||
|
_remote_pids[job_id], job_id,
|
||||||
|
)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
continue
|
||||||
|
|
||||||
output_lines.append(line)
|
output_lines.append(line)
|
||||||
|
|
||||||
if is_stderr:
|
if is_stderr:
|
||||||
|
|
@ -610,7 +999,7 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
|
|
||||||
# Abort on bad block threshold
|
# Abort on bad block threshold
|
||||||
if bad_blocks_total > settings.bad_block_threshold:
|
if bad_blocks_total > settings.bad_block_threshold:
|
||||||
proc.kill()
|
await _kill_remote_process(job_id)
|
||||||
output_lines.append(
|
output_lines.append(
|
||||||
f"\n[ABORTED] {bad_blocks_total} bad block(s) exceeded "
|
f"\n[ABORTED] {bad_blocks_total} bad block(s) exceeded "
|
||||||
f"threshold ({settings.bad_block_threshold})\n"
|
f"threshold ({settings.bad_block_threshold})\n"
|
||||||
|
|
@ -618,7 +1007,7 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
return
|
return
|
||||||
|
|
||||||
if await _is_cancelled(job_id):
|
if await _is_cancelled(job_id):
|
||||||
proc.kill()
|
await _kill_remote_process(job_id)
|
||||||
return
|
return
|
||||||
|
|
||||||
await asyncio.gather(
|
await asyncio.gather(
|
||||||
|
|
@ -626,7 +1015,17 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
_drain(proc.stderr, True),
|
_drain(proc.stderr, True),
|
||||||
return_exceptions=True,
|
return_exceptions=True,
|
||||||
)
|
)
|
||||||
await proc.wait()
|
# Bound proc.wait so a remote process that ignored our kill
|
||||||
|
# signal (or that we never managed to kill) can't pin this
|
||||||
|
# task in the semaphore forever. Closing the connection on
|
||||||
|
# exit will deliver SIGPIPE to the remote on its next write.
|
||||||
|
try:
|
||||||
|
await asyncio.wait_for(proc.wait(), timeout=15)
|
||||||
|
except asyncio.TimeoutError:
|
||||||
|
log.warning(
|
||||||
|
"proc.wait() timed out for job %d — abandoning channel",
|
||||||
|
job_id,
|
||||||
|
)
|
||||||
|
|
||||||
# Flush remaining output
|
# Flush remaining output
|
||||||
remainder = "".join(output_lines)
|
remainder = "".join(output_lines)
|
||||||
|
|
@ -655,6 +1054,116 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
async def _stage_surface_validate_truenas(job_id: int, devname: str, drive_id: int) -> bool:
|
||||||
|
"""
|
||||||
|
Surface validation via TrueNAS CORE disk.wipe REST API.
|
||||||
|
Used on FreeBSD (TrueNAS CORE) where badblocks is unavailable.
|
||||||
|
|
||||||
|
Sends a FULL write-zero pass across the entire disk, polls progress,
|
||||||
|
then runs a post-wipe SMART attribute check to catch reallocated sectors.
|
||||||
|
"""
|
||||||
|
from app import ssh_client
|
||||||
|
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"[START] TrueNAS disk.wipe FULL — {devname}\n"
|
||||||
|
f"[NOTE] DESTRUCTIVE: all data on {devname} will be overwritten.\n\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Start the wipe job
|
||||||
|
try:
|
||||||
|
tn_job_id = await _client.wipe_disk(devname, "FULL")
|
||||||
|
except Exception as exc:
|
||||||
|
await _set_stage_error(job_id, "surface_validate", f"Failed to start disk.wipe: {exc}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"[JOB] TrueNAS wipe job started (job_id={tn_job_id})\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Poll until complete
|
||||||
|
log_flush_counter = 0
|
||||||
|
while True:
|
||||||
|
if await _is_cancelled(job_id):
|
||||||
|
try:
|
||||||
|
await _client.abort_job(tn_job_id)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return False
|
||||||
|
|
||||||
|
await asyncio.sleep(POLL_INTERVAL)
|
||||||
|
|
||||||
|
try:
|
||||||
|
job = await _client.get_job(tn_job_id)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("Wipe job poll failed: %s", exc, extra={"job_id": job_id})
|
||||||
|
await _append_stage_log(job_id, "surface_validate", f"[poll error] {exc}\n")
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not job:
|
||||||
|
await _set_stage_error(job_id, "surface_validate", f"Wipe job {tn_job_id} not found")
|
||||||
|
return False
|
||||||
|
|
||||||
|
state = job.get("state", "")
|
||||||
|
pct = int(job.get("progress", {}).get("percent", 0) or 0)
|
||||||
|
desc = job.get("progress", {}).get("description", "")
|
||||||
|
|
||||||
|
await _update_stage_percent(job_id, "surface_validate", min(pct, 99))
|
||||||
|
await _recalculate_progress(job_id)
|
||||||
|
_push_update()
|
||||||
|
|
||||||
|
# Log progress description every ~5 polls to avoid DB spam
|
||||||
|
log_flush_counter += 1
|
||||||
|
if desc and log_flush_counter % 5 == 0:
|
||||||
|
await _append_stage_log(job_id, "surface_validate", f"[{pct}%] {desc}\n")
|
||||||
|
|
||||||
|
if state == "SUCCESS":
|
||||||
|
await _update_stage_percent(job_id, "surface_validate", 100)
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"\n[DONE] Wipe job {tn_job_id} completed successfully.\n"
|
||||||
|
)
|
||||||
|
# Post-wipe SMART check — catch any sectors that failed under write stress
|
||||||
|
if ssh_client.is_configured() and drive_id is not None:
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
"[CHECK] Running post-wipe SMART attribute check...\n"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
attrs = await ssh_client.get_smart_attributes(devname)
|
||||||
|
await _store_smart_attrs(drive_id, attrs)
|
||||||
|
if attrs["failures"]:
|
||||||
|
error = "Post-wipe SMART check: " + "; ".join(attrs["failures"])
|
||||||
|
await _set_stage_error(job_id, "surface_validate", error)
|
||||||
|
return False
|
||||||
|
if attrs["warnings"]:
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
"[WARNING] " + "; ".join(attrs["warnings"]) + "\n"
|
||||||
|
)
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"[CHECK] SMART health: {attrs['health']} — no critical attributes.\n"
|
||||||
|
)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("Post-wipe SMART check failed: %s", exc)
|
||||||
|
await _append_stage_log(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"[WARN] Post-wipe SMART check failed (non-fatal): {exc}\n"
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
elif state in ("FAILED", "ABORTED", "ERROR"):
|
||||||
|
error_msg = job.get("error") or f"Disk wipe failed (state={state})"
|
||||||
|
await _set_stage_error(
|
||||||
|
job_id, "surface_validate",
|
||||||
|
f"TrueNAS disk.wipe FAILED: {error_msg}"
|
||||||
|
)
|
||||||
|
return False
|
||||||
|
# RUNNING or WAITING — keep polling
|
||||||
|
|
||||||
|
|
||||||
async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds: int) -> bool:
|
async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds: int) -> bool:
|
||||||
"""Simulate a timed stage with progress updates (mock / dev mode)."""
|
"""Simulate a timed stage with progress updates (mock / dev mode)."""
|
||||||
start = time.monotonic()
|
start = time.monotonic()
|
||||||
|
|
@ -681,21 +1190,47 @@ async def _stage_final_check(job_id: int, devname: str, drive_id: int | None = N
|
||||||
Verify drive passed all tests.
|
Verify drive passed all tests.
|
||||||
SSH mode: run smartctl -a and check critical attributes.
|
SSH mode: run smartctl -a and check critical attributes.
|
||||||
Mock mode: check SMART health field in DB.
|
Mock mode: check SMART health field in DB.
|
||||||
|
|
||||||
|
A transient SSH connectivity failure here must NOT invalidate a prior
|
||||||
|
multi-day surface_validate. Retry SSH-only failures, then soft-pass.
|
||||||
"""
|
"""
|
||||||
await asyncio.sleep(1)
|
await asyncio.sleep(1)
|
||||||
from app import ssh_client
|
from app import ssh_client
|
||||||
|
|
||||||
|
def _ssh_only(failures: list[str]) -> bool:
|
||||||
|
return bool(failures) and all(f.startswith("SSH error:") for f in failures)
|
||||||
|
|
||||||
if ssh_client.is_configured() and drive_id is not None:
|
if ssh_client.is_configured() and drive_id is not None:
|
||||||
try:
|
try:
|
||||||
attrs = await ssh_client.get_smart_attributes(devname)
|
attrs = await ssh_client.get_smart_attributes(devname)
|
||||||
|
for attempt in range(2):
|
||||||
|
if not _ssh_only(attrs.get("failures") or []):
|
||||||
|
break
|
||||||
|
log.warning(
|
||||||
|
"final_check SSH unreachable (attempt %d/3); retrying in 30s",
|
||||||
|
attempt + 1,
|
||||||
|
extra={"job_id": job_id, "devname": devname},
|
||||||
|
)
|
||||||
|
await asyncio.sleep(30)
|
||||||
|
attrs = await ssh_client.get_smart_attributes(devname)
|
||||||
|
|
||||||
|
failures = attrs.get("failures") or []
|
||||||
|
if _ssh_only(failures):
|
||||||
|
log.warning(
|
||||||
|
"final_check soft-pass: SSH unreachable after retries; prior stages stand",
|
||||||
|
extra={"job_id": job_id, "devname": devname, "ssh_error": failures},
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
await _store_smart_attrs(drive_id, attrs)
|
await _store_smart_attrs(drive_id, attrs)
|
||||||
if attrs["health"] == "FAILED" or attrs["failures"]:
|
if attrs["health"] == "FAILED" or failures:
|
||||||
failures = attrs["failures"] or [f"SMART health: {attrs['health']}"]
|
msg = failures or [f"SMART health: {attrs['health']}"]
|
||||||
await _set_stage_error(job_id, "final_check",
|
await _set_stage_error(job_id, "final_check",
|
||||||
"Final check failed: " + "; ".join(failures))
|
"Final check failed: " + "; ".join(msg))
|
||||||
return False
|
return False
|
||||||
return True
|
return True
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
log.warning("SSH final_check failed, falling back to DB check: %s", exc)
|
log.warning("SSH final_check raised, falling back to DB check: %s", exc)
|
||||||
|
|
||||||
# DB check (mock mode fallback)
|
# DB check (mock mode fallback)
|
||||||
async with _db() as db:
|
async with _db() as db:
|
||||||
|
|
@ -942,6 +1477,11 @@ async def check_stuck_jobs() -> None:
|
||||||
"UPDATE burnin_jobs SET state='unknown', finished_at=? WHERE id=?",
|
"UPDATE burnin_jobs SET state='unknown', finished_at=? WHERE id=?",
|
||||||
(now, job_id),
|
(now, job_id),
|
||||||
)
|
)
|
||||||
|
await db.execute(
|
||||||
|
"""UPDATE burnin_stages SET state='unknown', finished_at=?
|
||||||
|
WHERE burnin_job_id=? AND state='running'""",
|
||||||
|
(now, job_id),
|
||||||
|
)
|
||||||
await db.execute(
|
await db.execute(
|
||||||
"""INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message)
|
"""INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message)
|
||||||
VALUES (?,?,?,?,?)""",
|
VALUES (?,?,?,?,?)""",
|
||||||
|
|
@ -951,5 +1491,16 @@ async def check_stuck_jobs() -> None:
|
||||||
|
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
|
# Actually unstick the running tasks so they release their semaphore slot.
|
||||||
|
# Without this the DB state becomes 'unknown' but the asyncio task keeps
|
||||||
|
# holding the slot forever — which is the bug that left subsequent jobs
|
||||||
|
# permanently 'queued' until container restart.
|
||||||
|
for row in stuck:
|
||||||
|
job_id = row[0]
|
||||||
|
await _kill_remote_process(job_id)
|
||||||
|
task = _active_tasks.get(job_id)
|
||||||
|
if task and not task.done():
|
||||||
|
task.cancel()
|
||||||
|
|
||||||
_push_update()
|
_push_update()
|
||||||
log.warning("Marked %d stuck job(s) as unknown", len(stuck))
|
log.warning("Marked %d stuck job(s) as unknown", len(stuck))
|
||||||
|
|
|
||||||
|
|
@ -58,6 +58,21 @@ class Settings(BaseSettings):
|
||||||
# Bad-block tolerance — surface_validate fails if bad blocks exceed this
|
# Bad-block tolerance — surface_validate fails if bad blocks exceed this
|
||||||
bad_block_threshold: int = 0
|
bad_block_threshold: int = 0
|
||||||
|
|
||||||
|
# Surface-validate (badblocks) tunables — defaults match the Spearfoot
|
||||||
|
# disk-burnin.sh community script's recommended geometry for large HDDs.
|
||||||
|
# block_size : -b in bytes; aligned to AF (4 KiB) sectors. Bumping
|
||||||
|
# to 8192 roughly halves badblocks runtime on multi-TB
|
||||||
|
# drives at the cost of ~2x RAM in the test buffer.
|
||||||
|
# block_buffer : -c blocks held in memory per IO. 64 = badblocks
|
||||||
|
# default. Higher values = larger buffer, faster IO,
|
||||||
|
# more RAM (block_size * block_buffer bytes per pass).
|
||||||
|
# passes : -p value. 1 = repeat until one consecutive clean
|
||||||
|
# scan (current behavior). 2-3 for paranoid burn-in
|
||||||
|
# that re-confirms after finding errors.
|
||||||
|
surface_validate_block_size: int = 4096
|
||||||
|
surface_validate_block_buffer: int = 64
|
||||||
|
surface_validate_passes: int = 1
|
||||||
|
|
||||||
# SSH credentials for direct TrueNAS command execution (Stage 7)
|
# SSH credentials for direct TrueNAS command execution (Stage 7)
|
||||||
# When ssh_host is set, burn-in stages use SSH for smartctl/badblocks instead of REST API.
|
# When ssh_host is set, burn-in stages use SSH for smartctl/badblocks instead of REST API.
|
||||||
# Leave ssh_host empty to use the mock/REST API (development mode).
|
# Leave ssh_host empty to use the mock/REST API (development mode).
|
||||||
|
|
@ -68,7 +83,7 @@ class Settings(BaseSettings):
|
||||||
ssh_key: str = "" # PEM private key content (paste full key including headers)
|
ssh_key: str = "" # PEM private key content (paste full key including headers)
|
||||||
|
|
||||||
# Application version — used by the /api/v1/updates/check endpoint
|
# Application version — used by the /api/v1/updates/check endpoint
|
||||||
app_version: str = "1.0.0-7"
|
app_version: str = "1.0.0-21"
|
||||||
|
|
||||||
|
|
||||||
settings = Settings()
|
settings = Settings()
|
||||||
|
|
|
||||||
|
|
@ -89,6 +89,16 @@ _MIGRATIONS = [
|
||||||
"ALTER TABLE smart_tests ADD COLUMN raw_output TEXT",
|
"ALTER TABLE smart_tests ADD COLUMN raw_output TEXT",
|
||||||
# Stage 8: track last reset time so dashboard burn-in col clears after reset
|
# Stage 8: track last reset time so dashboard burn-in col clears after reset
|
||||||
"ALTER TABLE drives ADD COLUMN last_reset_at TEXT",
|
"ALTER TABLE drives ADD COLUMN last_reset_at TEXT",
|
||||||
|
# 1.0.0-15: pool-membership lock
|
||||||
|
"ALTER TABLE drives ADD COLUMN pool_name TEXT",
|
||||||
|
"ALTER TABLE drives ADD COLUMN pool_role TEXT",
|
||||||
|
"ALTER TABLE drives ADD COLUMN pool_seen_at TEXT",
|
||||||
|
# 1.0.0-19: enforce one active burn-in per drive at the storage layer.
|
||||||
|
# Closes the read-then-insert race in burnin.start_job — without this,
|
||||||
|
# two concurrent /api/v1/burnin/start requests for the same drive could
|
||||||
|
# both observe zero active jobs and both insert queued rows.
|
||||||
|
"""CREATE UNIQUE INDEX IF NOT EXISTS uniq_active_burnin_per_drive
|
||||||
|
ON burnin_jobs (drive_id) WHERE state IN ('queued', 'running')""",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -5,6 +5,7 @@ Disabled when SMTP_HOST is not set.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import asyncio
|
import asyncio
|
||||||
|
import html
|
||||||
import logging
|
import logging
|
||||||
import smtplib
|
import smtplib
|
||||||
import ssl
|
import ssl
|
||||||
|
|
@ -109,17 +110,61 @@ def _drive_rows_html(drives: list[dict]) -> str:
|
||||||
return "\n".join(rows)
|
return "\n".join(rows)
|
||||||
|
|
||||||
|
|
||||||
def _build_html(drives: list[dict], generated_at: str) -> str:
|
def _build_unlock_banner_html(events: list[dict]) -> str:
|
||||||
|
"""Banner listing every pool-drive unlock granted in the last 24h.
|
||||||
|
|
||||||
|
Every interpolated DB field is run through html.escape — operator and
|
||||||
|
reason are free-text from the unlock modal and otherwise inject into
|
||||||
|
the email body verbatim.
|
||||||
|
"""
|
||||||
|
if not events:
|
||||||
|
return ""
|
||||||
|
rows = []
|
||||||
|
for e in events:
|
||||||
|
evt = e.get("event_type") or ""
|
||||||
|
is_boot = evt == "boot_pool_drive_unlocked"
|
||||||
|
is_exported = evt == "exported_pool_drive_unlocked"
|
||||||
|
kind = (
|
||||||
|
"BOOT POOL" if is_boot
|
||||||
|
else "EXPORTED ZFS" if is_exported
|
||||||
|
else "pool"
|
||||||
|
)
|
||||||
|
when = html.escape((e.get("created_at") or "")[:19])
|
||||||
|
operator = html.escape(e.get("operator") or "?")
|
||||||
|
devname = html.escape(e.get("devname") or "?")
|
||||||
|
# `message` already includes pool name, devname, and the operator's
|
||||||
|
# reason — surface it verbatim so the audit trail is faithful.
|
||||||
|
message = html.escape(e.get("message") or "")
|
||||||
|
rows.append(
|
||||||
|
f"<li style='margin:4px 0'><strong>{when}</strong> · "
|
||||||
|
f"<strong>{operator}</strong> unlocked a {kind} drive "
|
||||||
|
f"({devname}): "
|
||||||
|
f"<span style='color:#c9d1d9'>{message}</span></li>"
|
||||||
|
)
|
||||||
|
return f"""
|
||||||
|
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;
|
||||||
|
padding:14px 18px;margin-bottom:20px;color:#f85149">
|
||||||
|
<div style="font-weight:600;font-size:14px;margin-bottom:6px">
|
||||||
|
⚠ {len(events)} pool-drive unlock(s) in the last 24h
|
||||||
|
</div>
|
||||||
|
<ul style="margin:0;padding-left:18px;font-size:12.5px;color:#f0a0a0">
|
||||||
|
{''.join(rows)}
|
||||||
|
</ul>
|
||||||
|
</div>"""
|
||||||
|
|
||||||
|
|
||||||
|
def _build_html(drives: list[dict], generated_at: str,
|
||||||
|
unlock_events: list[dict] | None = None) -> str:
|
||||||
total = len(drives)
|
total = len(drives)
|
||||||
failed_drives = [d for d in drives if d.get("smart_health") == "FAILED"]
|
failed_drives = [d for d in drives if d.get("smart_health") == "FAILED"]
|
||||||
running_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "running"]
|
running_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "running"]
|
||||||
passed_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "passed"]
|
passed_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "passed"]
|
||||||
|
|
||||||
# Alert banner
|
# Alert banners (unlock events first — the audit-grade signal)
|
||||||
alert_html = ""
|
alert_html = _build_unlock_banner_html(unlock_events or [])
|
||||||
if failed_drives:
|
if failed_drives:
|
||||||
names = ", ".join(d["devname"] for d in failed_drives)
|
names = ", ".join(d["devname"] for d in failed_drives)
|
||||||
alert_html = f"""
|
alert_html += f"""
|
||||||
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;padding:14px 18px;margin-bottom:20px;color:#f85149;font-weight:500">
|
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;padding:14px 18px;margin-bottom:20px;color:#f85149;font-weight:500">
|
||||||
⚠ SMART health FAILED on {len(failed_drives)} drive(s): {names}
|
⚠ SMART health FAILED on {len(failed_drives)} drive(s): {names}
|
||||||
</div>"""
|
</div>"""
|
||||||
|
|
@ -287,6 +332,36 @@ async def _fetch_report_data() -> list[dict]:
|
||||||
return await _fetch_drives_for_template(db)
|
return await _fetch_drives_for_template(db)
|
||||||
|
|
||||||
|
|
||||||
|
async def _fetch_unlock_events_24h() -> list[dict]:
|
||||||
|
"""Return pool-drive unlock audit events from the last 24 hours.
|
||||||
|
|
||||||
|
These are operator overrides of the pool-membership lock — every entry
|
||||||
|
represents a deliberate decision to risk a pool, so the daily report
|
||||||
|
surfaces them as an audit-grade banner.
|
||||||
|
"""
|
||||||
|
async with aiosqlite.connect(settings.db_path) as db:
|
||||||
|
db.row_factory = aiosqlite.Row
|
||||||
|
await db.execute("PRAGMA journal_mode=WAL")
|
||||||
|
# julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format
|
||||||
|
# we write from Python; comparing the raw string against
|
||||||
|
# datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS')
|
||||||
|
# produces subtle off-by-up-to-a-day errors because of the
|
||||||
|
# 'T' vs ' ' separator and the '+00:00' suffix.
|
||||||
|
cur = await db.execute("""
|
||||||
|
SELECT ae.event_type, ae.operator, ae.message, ae.created_at,
|
||||||
|
d.devname, d.pool_name, d.pool_role
|
||||||
|
FROM audit_events ae
|
||||||
|
LEFT JOIN drives d ON d.id = ae.drive_id
|
||||||
|
WHERE ae.event_type IN (
|
||||||
|
'pool_drive_unlocked',
|
||||||
|
'boot_pool_drive_unlocked',
|
||||||
|
'exported_pool_drive_unlocked')
|
||||||
|
AND julianday(ae.created_at) >= julianday('now', '-1 day')
|
||||||
|
ORDER BY ae.created_at DESC
|
||||||
|
""")
|
||||||
|
return [dict(r) for r in await cur.fetchall()]
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Scheduler
|
# Scheduler
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
@ -411,9 +486,16 @@ async def test_smtp_connection() -> dict:
|
||||||
async def send_report_now() -> None:
|
async def send_report_now() -> None:
|
||||||
"""Send a report immediately (used by on-demand API endpoint)."""
|
"""Send a report immediately (used by on-demand API endpoint)."""
|
||||||
drives = await _fetch_report_data()
|
drives = await _fetch_report_data()
|
||||||
|
unlock_events = await _fetch_unlock_events_24h()
|
||||||
now_str = datetime.now().strftime("%Y-%m-%d %H:%M")
|
now_str = datetime.now().strftime("%Y-%m-%d %H:%M")
|
||||||
html = _build_html(drives, now_str)
|
html = _build_html(drives, now_str, unlock_events)
|
||||||
subject = f"Burn-In Report — {datetime.now().strftime('%Y-%m-%d')} ({len(drives)} drives)"
|
suffix = ""
|
||||||
|
if unlock_events:
|
||||||
|
suffix = f" — {len(unlock_events)} pool unlock(s)"
|
||||||
|
subject = (
|
||||||
|
f"Burn-In Report — {datetime.now().strftime('%Y-%m-%d')} "
|
||||||
|
f"({len(drives)} drives){suffix}"
|
||||||
|
)
|
||||||
await asyncio.to_thread(_send_email, subject, html)
|
await asyncio.to_thread(_send_email, subject, html)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -97,8 +97,17 @@ class DriveResponse(BaseModel):
|
||||||
smart_long: SmartTestState
|
smart_long: SmartTestState
|
||||||
notes: str | None = None
|
notes: str | None = None
|
||||||
location: str | None = None
|
location: str | None = None
|
||||||
|
pool_name: str | None = None
|
||||||
|
pool_role: str | None = None
|
||||||
|
pool_unlocked_until: float | None = None # unix epoch; null = locked
|
||||||
|
|
||||||
|
|
||||||
class UpdateDriveRequest(BaseModel):
|
class UpdateDriveRequest(BaseModel):
|
||||||
notes: str | None = None
|
notes: str | None = None
|
||||||
location: str | None = None
|
location: str | None = None
|
||||||
|
|
||||||
|
|
||||||
|
class UnlockPoolDriveRequest(BaseModel):
|
||||||
|
confirm_token: str
|
||||||
|
operator: str
|
||||||
|
reason: str
|
||||||
|
|
|
||||||
248
app/poller.py
248
app/poller.py
|
|
@ -20,13 +20,15 @@ from app.truenas import TrueNASClient
|
||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
# Shared state read by the /health endpoint
|
# Shared state read by the /health endpoint and dashboard template
|
||||||
_state: dict[str, Any] = {
|
_state: dict[str, Any] = {
|
||||||
"last_poll_at": None,
|
"last_poll_at": None,
|
||||||
"last_error": None,
|
"last_error": None,
|
||||||
"healthy": False,
|
"healthy": False,
|
||||||
"drives_seen": 0,
|
"drives_seen": 0,
|
||||||
"consecutive_failures": 0,
|
"consecutive_failures": 0,
|
||||||
|
"system_temps": {}, # {"cpu_c": int|None, "pch_c": int|None}
|
||||||
|
"thermal_pressure": "ok", # "ok" | "warn" | "crit" — based on running burn-in drive temps
|
||||||
}
|
}
|
||||||
|
|
||||||
# SSE subscriber queues — notified after each successful poll
|
# SSE subscriber queues — notified after each successful poll
|
||||||
|
|
@ -87,18 +89,60 @@ def _map_history_state(status: str) -> str:
|
||||||
# DB helpers
|
# DB helpers
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str) -> int:
|
async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str,
|
||||||
await db.execute(
|
pool_info: dict | None = None,
|
||||||
|
update_pool: bool = True) -> int:
|
||||||
|
"""Insert/update a drive row.
|
||||||
|
|
||||||
|
pool_info: {"pool": str, "role": str} if this drive is currently in a
|
||||||
|
zpool, else None. None values clear pool columns so a removed-from-pool
|
||||||
|
drive doesn't stay locked.
|
||||||
|
|
||||||
|
update_pool: when False, pool columns are preserved on conflict and
|
||||||
|
initialised to NULL on insert. Callers pass False on detection failure
|
||||||
|
so a transient SSH outage doesn't silently unlock every drive.
|
||||||
"""
|
"""
|
||||||
INSERT INTO drives
|
pool_name = pool_info["pool"] if pool_info else None
|
||||||
(truenas_disk_id, devname, serial, model, size_bytes,
|
pool_role = pool_info["role"] if pool_info else None
|
||||||
temperature_c, smart_health, last_seen_at, last_polled_at)
|
pool_seen_at = now if pool_info else None
|
||||||
VALUES (?,?,?,?,?,?,?,?,?)
|
|
||||||
ON CONFLICT(truenas_disk_id) DO UPDATE SET
|
if update_pool:
|
||||||
|
update_clause = """
|
||||||
|
devname = excluded.devname,
|
||||||
|
serial = excluded.serial,
|
||||||
|
model = excluded.model,
|
||||||
|
size_bytes = excluded.size_bytes,
|
||||||
|
temperature_c = excluded.temperature_c,
|
||||||
|
smart_health = excluded.smart_health,
|
||||||
|
last_seen_at = excluded.last_seen_at,
|
||||||
|
last_polled_at = excluded.last_polled_at,
|
||||||
|
pool_name = excluded.pool_name,
|
||||||
|
pool_role = excluded.pool_role,
|
||||||
|
pool_seen_at = excluded.pool_seen_at
|
||||||
|
"""
|
||||||
|
else:
|
||||||
|
# Preserve pool_name / pool_role / pool_seen_at — detection failed
|
||||||
|
# this cycle, so we have no fresh data and must not overwrite.
|
||||||
|
update_clause = """
|
||||||
|
devname = excluded.devname,
|
||||||
|
serial = excluded.serial,
|
||||||
|
model = excluded.model,
|
||||||
|
size_bytes = excluded.size_bytes,
|
||||||
temperature_c = excluded.temperature_c,
|
temperature_c = excluded.temperature_c,
|
||||||
smart_health = excluded.smart_health,
|
smart_health = excluded.smart_health,
|
||||||
last_seen_at = excluded.last_seen_at,
|
last_seen_at = excluded.last_seen_at,
|
||||||
last_polled_at = excluded.last_polled_at
|
last_polled_at = excluded.last_polled_at
|
||||||
|
"""
|
||||||
|
|
||||||
|
await db.execute(
|
||||||
|
f"""
|
||||||
|
INSERT INTO drives
|
||||||
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
||||||
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
||||||
|
pool_name, pool_role, pool_seen_at)
|
||||||
|
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
|
||||||
|
ON CONFLICT(truenas_disk_id) DO UPDATE SET
|
||||||
|
{update_clause}
|
||||||
""",
|
""",
|
||||||
(
|
(
|
||||||
disk["identifier"],
|
disk["identifier"],
|
||||||
|
|
@ -110,6 +154,9 @@ async def _upsert_drive(db: aiosqlite.Connection, disk: dict, now: str) -> int:
|
||||||
disk.get("smart_health", "UNKNOWN"),
|
disk.get("smart_health", "UNKNOWN"),
|
||||||
now,
|
now,
|
||||||
now,
|
now,
|
||||||
|
pool_name,
|
||||||
|
pool_role,
|
||||||
|
pool_seen_at,
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
cur = await db.execute(
|
cur = await db.execute(
|
||||||
|
|
@ -208,6 +255,67 @@ async def _sync_history(
|
||||||
# Poll cycle
|
# Poll cycle
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
async def _poll_smart_via_ssh(db: aiosqlite.Connection, now: str) -> None:
|
||||||
|
"""
|
||||||
|
Poll progress for SMART tests started via SSH (truenas_job_id IS NULL).
|
||||||
|
Used on TrueNAS SCALE 25.10+ where the REST smart/test API no longer exists.
|
||||||
|
"""
|
||||||
|
from app import ssh_client
|
||||||
|
if not ssh_client.is_configured():
|
||||||
|
return
|
||||||
|
|
||||||
|
cur = await db.execute(
|
||||||
|
"""SELECT st.id, st.test_type, st.drive_id, d.devname, st.started_at
|
||||||
|
FROM smart_tests st
|
||||||
|
JOIN drives d ON d.id = st.drive_id
|
||||||
|
WHERE st.state = 'running' AND st.truenas_job_id IS NULL"""
|
||||||
|
)
|
||||||
|
rows = await cur.fetchall()
|
||||||
|
if not rows:
|
||||||
|
return
|
||||||
|
|
||||||
|
for row in rows:
|
||||||
|
test_id, ttype, drive_id, devname, started_at = row[0], row[1], row[2], row[3], row[4]
|
||||||
|
try:
|
||||||
|
progress = await ssh_client.poll_smart_progress(devname)
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("SSH SMART poll failed for %s: %s", devname, exc)
|
||||||
|
continue
|
||||||
|
|
||||||
|
state = progress["state"]
|
||||||
|
pct_remaining = progress.get("percent_remaining") # None = not yet in output
|
||||||
|
raw_output = progress.get("output", "")
|
||||||
|
|
||||||
|
if state == "running":
|
||||||
|
# pct_remaining=None means smartctl output doesn't have the % line yet
|
||||||
|
# (test just started) — keep percent at 0 rather than jumping to 100
|
||||||
|
if pct_remaining is None:
|
||||||
|
pct = 0
|
||||||
|
else:
|
||||||
|
pct = max(0, 100 - pct_remaining)
|
||||||
|
eta = _eta_from_progress(pct, started_at)
|
||||||
|
await db.execute(
|
||||||
|
"UPDATE smart_tests SET percent=?, eta_at=?, raw_output=? WHERE id=?",
|
||||||
|
(pct, eta, raw_output, test_id),
|
||||||
|
)
|
||||||
|
elif state == "passed":
|
||||||
|
await db.execute(
|
||||||
|
"UPDATE smart_tests SET state='passed', percent=100, finished_at=?, raw_output=? WHERE id=?",
|
||||||
|
(now, raw_output, test_id),
|
||||||
|
)
|
||||||
|
log.info("SSH SMART %s passed on %s", ttype, devname)
|
||||||
|
elif state == "failed":
|
||||||
|
await db.execute(
|
||||||
|
"UPDATE smart_tests SET state='failed', percent=0, finished_at=?, "
|
||||||
|
"error_text=?, raw_output=? WHERE id=?",
|
||||||
|
(now, f"SMART {ttype.upper()} test failed", raw_output, test_id),
|
||||||
|
)
|
||||||
|
log.warning("SSH SMART %s FAILED on %s", ttype, devname)
|
||||||
|
# state == "unknown" → keep polling, no update
|
||||||
|
|
||||||
|
await db.commit()
|
||||||
|
|
||||||
|
|
||||||
async def poll_cycle(client: TrueNASClient) -> int:
|
async def poll_cycle(client: TrueNASClient) -> int:
|
||||||
"""Run one full poll. Returns number of drives seen."""
|
"""Run one full poll. Returns number of drives seen."""
|
||||||
now = _now()
|
now = _now()
|
||||||
|
|
@ -215,6 +323,88 @@ async def poll_cycle(client: TrueNASClient) -> int:
|
||||||
disks = await client.get_disks()
|
disks = await client.get_disks()
|
||||||
running_jobs = await client.get_smart_jobs(state="RUNNING")
|
running_jobs = await client.get_smart_jobs(state="RUNNING")
|
||||||
|
|
||||||
|
# Fetch temperatures via SCALE-specific endpoint.
|
||||||
|
# CORE doesn't have this endpoint — silently skip on any error.
|
||||||
|
try:
|
||||||
|
temps = await client.get_disk_temperatures()
|
||||||
|
except Exception:
|
||||||
|
temps = {}
|
||||||
|
|
||||||
|
# Inject temperature into each disk dict (SCALE 25.10 has no temp in /disk)
|
||||||
|
for disk in disks:
|
||||||
|
devname = disk.get("devname", "")
|
||||||
|
t = temps.get(devname)
|
||||||
|
if t is not None:
|
||||||
|
disk["temperature"] = int(round(t))
|
||||||
|
|
||||||
|
# SMART health — TrueNAS /api/v2.0/disk doesn't expose smart_health,
|
||||||
|
# so without this every drive defaults to UNKNOWN forever (only burn-in
|
||||||
|
# stages used to populate it). Run `smartctl -H` over a single SSH
|
||||||
|
# session for every drive every Nth cycle. Cache between cycles via
|
||||||
|
# _state so the dashboard always renders the most recent answer.
|
||||||
|
SMART_HEALTH_EVERY_N_CYCLES = 5 # ~1 minute at default 12s interval
|
||||||
|
_state.setdefault("smart_health_cache", {})
|
||||||
|
cycle_n = _state.get("cycle", 0) + 1
|
||||||
|
_state["cycle"] = cycle_n
|
||||||
|
try:
|
||||||
|
from app import ssh_client as _ssh
|
||||||
|
if _ssh.is_configured() and (cycle_n % SMART_HEALTH_EVERY_N_CYCLES == 1):
|
||||||
|
health_map = await _ssh.get_smart_health_map(
|
||||||
|
[d["devname"] for d in disks if d.get("devname")]
|
||||||
|
)
|
||||||
|
if health_map is not None:
|
||||||
|
_state["smart_health_cache"] = health_map
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("smart_health refresh failed: %s", exc)
|
||||||
|
health_cache = _state.get("smart_health_cache") or {}
|
||||||
|
for disk in disks:
|
||||||
|
devname = disk.get("devname", "")
|
||||||
|
h = health_cache.get(devname)
|
||||||
|
if h:
|
||||||
|
disk["smart_health"] = h
|
||||||
|
|
||||||
|
# Pool membership map — drives in any zpool are locked from burn-in.
|
||||||
|
# ssh_client returns None on failure (distinct from {} which means "no
|
||||||
|
# pools"). If EITHER detection call fails we fail-closed: leave
|
||||||
|
# pool_name / pool_role columns alone so previously-locked drives stay
|
||||||
|
# locked, and previously-unlocked drives stay unlocked, until detection
|
||||||
|
# recovers. Treating a transient SSH blip as "no pool members" would
|
||||||
|
# silently unlock every drive on the next poll.
|
||||||
|
detection_ok = True
|
||||||
|
pool_map: dict = {}
|
||||||
|
zfs_member_set: set = set()
|
||||||
|
try:
|
||||||
|
from app import ssh_client as _ssh
|
||||||
|
if _ssh.is_configured():
|
||||||
|
pm = await _ssh.get_pool_membership()
|
||||||
|
zs = await _ssh.get_zfs_member_drives()
|
||||||
|
if pm is None or zs is None:
|
||||||
|
detection_ok = False
|
||||||
|
else:
|
||||||
|
pool_map = pm
|
||||||
|
zfs_member_set = zs
|
||||||
|
# SSH unconfigured (mock/dev mode) — detection_ok stays True with
|
||||||
|
# empty maps, so dev mode never artificially locks drives.
|
||||||
|
except Exception:
|
||||||
|
detection_ok = False
|
||||||
|
|
||||||
|
if not detection_ok:
|
||||||
|
log.warning(
|
||||||
|
"Pool detection failed this cycle — preserving existing "
|
||||||
|
"pool_name/pool_role columns. Locked drives stay locked, "
|
||||||
|
"unlocked drives stay unlocked, until SSH recovers."
|
||||||
|
)
|
||||||
|
|
||||||
|
if detection_ok:
|
||||||
|
# Drives carrying ZFS labels but not in any active pool are
|
||||||
|
# "exported" — same hazard as an active pool member, so lock them
|
||||||
|
# too. We don't know the original pool name without
|
||||||
|
# `zpool import`-style scanning (slow + blocks); display
|
||||||
|
# "(exported)" and use a special token.
|
||||||
|
for devname in zfs_member_set:
|
||||||
|
if devname not in pool_map:
|
||||||
|
pool_map[devname] = {"pool": "(exported)", "role": "exported"}
|
||||||
|
|
||||||
# Index running jobs by (devname, test_type)
|
# Index running jobs by (devname, test_type)
|
||||||
active: dict[tuple[str, str], dict] = {}
|
active: dict[tuple[str, str], dict] = {}
|
||||||
for job in running_jobs:
|
for job in running_jobs:
|
||||||
|
|
@ -233,7 +423,11 @@ async def poll_cycle(client: TrueNASClient) -> int:
|
||||||
|
|
||||||
for disk in disks:
|
for disk in disks:
|
||||||
devname = disk["devname"]
|
devname = disk["devname"]
|
||||||
drive_id = await _upsert_drive(db, disk, now)
|
drive_id = await _upsert_drive(
|
||||||
|
db, disk, now,
|
||||||
|
pool_map.get(devname) if detection_ok else None,
|
||||||
|
update_pool=detection_ok,
|
||||||
|
)
|
||||||
|
|
||||||
for ttype in ("short", "long"):
|
for ttype in ("short", "long"):
|
||||||
if (devname, ttype) in active:
|
if (devname, ttype) in active:
|
||||||
|
|
@ -243,6 +437,9 @@ async def poll_cycle(client: TrueNASClient) -> int:
|
||||||
|
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
|
||||||
|
# SSH SMART polling — for tests started via smartctl (no TrueNAS REST job)
|
||||||
|
await _poll_smart_via_ssh(db, now)
|
||||||
|
|
||||||
return len(disks)
|
return len(disks)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -263,6 +460,39 @@ async def run(client: TrueNASClient) -> None:
|
||||||
_state["drives_seen"] = count
|
_state["drives_seen"] = count
|
||||||
_state["consecutive_failures"] = 0
|
_state["consecutive_failures"] = 0
|
||||||
log.debug("Poll OK", extra={"drives": count})
|
log.debug("Poll OK", extra={"drives": count})
|
||||||
|
|
||||||
|
# System sensor temps via SSH (non-fatal)
|
||||||
|
from app import ssh_client as _ssh
|
||||||
|
if _ssh.is_configured():
|
||||||
|
try:
|
||||||
|
_state["system_temps"] = await _ssh.get_system_sensors()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Thermal pressure: max temp of drives currently under burn-in
|
||||||
|
try:
|
||||||
|
async with aiosqlite.connect(settings.db_path) as _tdb:
|
||||||
|
_tdb.row_factory = aiosqlite.Row
|
||||||
|
await _tdb.execute("PRAGMA journal_mode=WAL")
|
||||||
|
_cur = await _tdb.execute("""
|
||||||
|
SELECT MAX(d.temperature_c)
|
||||||
|
FROM drives d
|
||||||
|
JOIN burnin_jobs bj ON bj.drive_id = d.id
|
||||||
|
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
|
||||||
|
""")
|
||||||
|
_row = await _cur.fetchone()
|
||||||
|
_max_t = _row[0] if _row and _row[0] is not None else None
|
||||||
|
if _max_t is None:
|
||||||
|
_state["thermal_pressure"] = "ok"
|
||||||
|
elif _max_t >= settings.temp_crit_c:
|
||||||
|
_state["thermal_pressure"] = "crit"
|
||||||
|
elif _max_t >= settings.temp_warn_c:
|
||||||
|
_state["thermal_pressure"] = "warn"
|
||||||
|
else:
|
||||||
|
_state["thermal_pressure"] = "ok"
|
||||||
|
except Exception:
|
||||||
|
_state["thermal_pressure"] = "ok"
|
||||||
|
|
||||||
_notify_subscribers()
|
_notify_subscribers()
|
||||||
|
|
||||||
# Check for stuck jobs every 5 cycles (~1 min at default 12s interval)
|
# Check for stuck jobs every 5 cycles (~1 min at default 12s interval)
|
||||||
|
|
|
||||||
182
app/routes.py
182
app/routes.py
|
|
@ -15,7 +15,8 @@ from app.database import get_db
|
||||||
from app.models import (
|
from app.models import (
|
||||||
BurninJobResponse, BurninStageResponse,
|
BurninJobResponse, BurninStageResponse,
|
||||||
CancelBurninRequest, DriveResponse,
|
CancelBurninRequest, DriveResponse,
|
||||||
SmartTestState, StartBurninRequest, UpdateDriveRequest,
|
SmartTestState, StartBurninRequest, UnlockPoolDriveRequest,
|
||||||
|
UpdateDriveRequest,
|
||||||
)
|
)
|
||||||
from app.renderer import templates
|
from app.renderer import templates
|
||||||
|
|
||||||
|
|
@ -48,6 +49,22 @@ def _is_stale(last_polled_at: str) -> bool:
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def _compute_eta_seconds(started_at: str | None, percent: int) -> int | None:
|
||||||
|
"""Linear ETA extrapolation from started_at and percent complete."""
|
||||||
|
if not started_at or percent <= 0:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
start = datetime.fromisoformat(started_at)
|
||||||
|
if start.tzinfo is None:
|
||||||
|
start = start.replace(tzinfo=timezone.utc)
|
||||||
|
elapsed = (datetime.now(timezone.utc) - start).total_seconds()
|
||||||
|
total_est = elapsed / (percent / 100)
|
||||||
|
remaining = max(0, int(total_est - elapsed))
|
||||||
|
return remaining
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
def _build_smart(row: aiosqlite.Row, prefix: str) -> SmartTestState:
|
def _build_smart(row: aiosqlite.Row, prefix: str) -> SmartTestState:
|
||||||
eta_at = row[f"{prefix}_eta_at"]
|
eta_at = row[f"{prefix}_eta_at"]
|
||||||
return SmartTestState(
|
return SmartTestState(
|
||||||
|
|
@ -76,6 +93,11 @@ def _row_to_drive(row: aiosqlite.Row) -> DriveResponse:
|
||||||
smart_long=_build_smart(row, "long"),
|
smart_long=_build_smart(row, "long"),
|
||||||
notes=row["notes"],
|
notes=row["notes"],
|
||||||
location=row["location"],
|
location=row["location"],
|
||||||
|
pool_name=row["pool_name"],
|
||||||
|
pool_role=row["pool_role"],
|
||||||
|
pool_unlocked_until=burnin.unlock_expiry(
|
||||||
|
row["id"], row["pool_name"], row["pool_role"],
|
||||||
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -96,7 +118,7 @@ _DRIVES_QUERY = """
|
||||||
SELECT
|
SELECT
|
||||||
d.id, d.devname, d.serial, d.model, d.size_bytes,
|
d.id, d.devname, d.serial, d.model, d.size_bytes,
|
||||||
d.temperature_c, d.smart_health, d.last_polled_at,
|
d.temperature_c, d.smart_health, d.last_polled_at,
|
||||||
d.notes, d.location,
|
d.notes, d.location, d.pool_name, d.pool_role,
|
||||||
s.state AS short_state,
|
s.state AS short_state,
|
||||||
s.percent AS short_percent,
|
s.percent AS short_percent,
|
||||||
s.started_at AS short_started_at,
|
s.started_at AS short_started_at,
|
||||||
|
|
@ -112,6 +134,7 @@ _DRIVES_QUERY = """
|
||||||
FROM drives d
|
FROM drives d
|
||||||
LEFT JOIN smart_tests s ON s.drive_id = d.id AND s.test_type = 'short'
|
LEFT JOIN smart_tests s ON s.drive_id = d.id AND s.test_type = 'short'
|
||||||
LEFT JOIN smart_tests l ON l.drive_id = d.id AND l.test_type = 'long'
|
LEFT JOIN smart_tests l ON l.drive_id = d.id AND l.test_type = 'long'
|
||||||
|
WHERE d.last_seen_at >= datetime('now', '-7 days')
|
||||||
{where}
|
{where}
|
||||||
ORDER BY d.devname
|
ORDER BY d.devname
|
||||||
"""
|
"""
|
||||||
|
|
@ -138,11 +161,55 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
|
||||||
cur = await db.execute(_DRIVES_QUERY.format(where=""))
|
cur = await db.execute(_DRIVES_QUERY.format(where=""))
|
||||||
rows = await cur.fetchall()
|
rows = await cur.fetchall()
|
||||||
burnin_by_drive = await _fetch_burnin_by_drive(db)
|
burnin_by_drive = await _fetch_burnin_by_drive(db)
|
||||||
|
|
||||||
|
# For burn-ins that include SMART stages, fetch those stages so we can
|
||||||
|
# mirror their progress/result in the Short/Long SMART columns.
|
||||||
|
# This covers both running stages (showing live progress) and completed
|
||||||
|
# stages (showing passed/failed after the burn-in moves to the next stage).
|
||||||
|
bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row}
|
||||||
|
bi_ids_with_smart = [
|
||||||
|
bi["id"] for bi in burnin_by_drive.values()
|
||||||
|
if bi["state"] in ("running", "queued")
|
||||||
|
]
|
||||||
|
if bi_ids_with_smart:
|
||||||
|
placeholders = ",".join("?" * len(bi_ids_with_smart))
|
||||||
|
cur = await db.execute(f"""
|
||||||
|
SELECT bs.burnin_job_id, bs.stage_name, bs.state, bs.percent,
|
||||||
|
bs.started_at, bs.finished_at, bs.error_text
|
||||||
|
FROM burnin_stages bs
|
||||||
|
WHERE bs.burnin_job_id IN ({placeholders})
|
||||||
|
AND bs.stage_name IN ('short_smart', 'long_smart')
|
||||||
|
AND bs.state IN ('running', 'passed', 'failed')
|
||||||
|
""", bi_ids_with_smart)
|
||||||
|
for r in await cur.fetchall():
|
||||||
|
bi_smart_stages.setdefault(r["burnin_job_id"], {})[r["stage_name"]] = dict(r)
|
||||||
|
|
||||||
drives = []
|
drives = []
|
||||||
for row in rows:
|
for row in rows:
|
||||||
d = _row_to_drive(row).model_dump()
|
d = _row_to_drive(row).model_dump()
|
||||||
d["status"] = _compute_status(d)
|
d["status"] = _compute_status(d)
|
||||||
d["burnin"] = burnin_by_drive.get(d["id"])
|
bi = burnin_by_drive.get(d["id"])
|
||||||
|
d["burnin"] = bi
|
||||||
|
|
||||||
|
# Overlay burn-in SMART stage progress/results onto the SMART columns
|
||||||
|
if bi and bi["id"] in bi_smart_stages:
|
||||||
|
for stage_name, stage in bi_smart_stages[bi["id"]].items():
|
||||||
|
target = "smart_short" if stage_name == "short_smart" else "smart_long"
|
||||||
|
# Only overlay if the standalone SMART column is idle/empty
|
||||||
|
existing = d.get(target) or {}
|
||||||
|
if existing.get("state") not in (None, "idle"):
|
||||||
|
continue
|
||||||
|
pct = stage["percent"] or 0
|
||||||
|
d[target] = {
|
||||||
|
"state": stage["state"],
|
||||||
|
"percent": pct if stage["state"] == "running" else (100 if stage["state"] == "passed" else 0),
|
||||||
|
"eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage["state"] == "running" else None,
|
||||||
|
"eta_timestamp": None,
|
||||||
|
"started_at": stage["started_at"],
|
||||||
|
"finished_at": stage["finished_at"],
|
||||||
|
"error_text": stage["error_text"],
|
||||||
|
}
|
||||||
|
|
||||||
drives.append(d)
|
drives.append(d)
|
||||||
return drives
|
return drives
|
||||||
|
|
||||||
|
|
@ -170,7 +237,7 @@ def _stale_context(poller_state: dict) -> dict:
|
||||||
async def dashboard(request: Request, db: aiosqlite.Connection = Depends(get_db)):
|
async def dashboard(request: Request, db: aiosqlite.Connection = Depends(get_db)):
|
||||||
drives = await _fetch_drives_for_template(db)
|
drives = await _fetch_drives_for_template(db)
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("dashboard.html", {
|
return templates.TemplateResponse(request, "dashboard.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"drives": drives,
|
"drives": drives,
|
||||||
"poller": ps,
|
"poller": ps,
|
||||||
|
|
@ -218,6 +285,18 @@ async def sse_drives(request: Request):
|
||||||
|
|
||||||
yield {"event": "drives-update", "data": html}
|
yield {"event": "drives-update", "data": html}
|
||||||
|
|
||||||
|
# Push system sensor state so JS can update temp chips live
|
||||||
|
ps = poller.get_state()
|
||||||
|
yield {
|
||||||
|
"event": "system-sensors",
|
||||||
|
"data": json.dumps({
|
||||||
|
"system_temps": ps.get("system_temps", {}),
|
||||||
|
"thermal_pressure": ps.get("thermal_pressure", "ok"),
|
||||||
|
"temp_warn_c": settings.temp_warn_c,
|
||||||
|
"temp_crit_c": settings.temp_crit_c,
|
||||||
|
}),
|
||||||
|
}
|
||||||
|
|
||||||
# Push browser notification event if this was a job completion
|
# Push browser notification event if this was a job completion
|
||||||
if alert:
|
if alert:
|
||||||
yield {"event": "job-alert", "data": json.dumps(alert)}
|
yield {"event": "job-alert", "data": json.dumps(alert)}
|
||||||
|
|
@ -258,7 +337,7 @@ async def list_drives(db: aiosqlite.Connection = Depends(get_db)):
|
||||||
@router.get("/api/v1/drives/{drive_id}/drawer")
|
@router.get("/api/v1/drives/{drive_id}/drawer")
|
||||||
async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
|
async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
|
||||||
"""Data for the log drawer — latest burn-in job + stages, SMART tests, audit events."""
|
"""Data for the log drawer — latest burn-in job + stages, SMART tests, audit events."""
|
||||||
cur = await db.execute(_DRIVES_QUERY.format(where="WHERE d.id = ?"), (drive_id,))
|
cur = await db.execute(_DRIVES_QUERY.format(where="AND d.id = ?"), (drive_id,))
|
||||||
row = await cur.fetchone()
|
row = await cur.fetchone()
|
||||||
if not row:
|
if not row:
|
||||||
raise HTTPException(status_code=404, detail="Drive not found")
|
raise HTTPException(status_code=404, detail="Drive not found")
|
||||||
|
|
@ -339,7 +418,7 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
|
||||||
@router.get("/api/v1/drives/{drive_id}", response_model=DriveResponse)
|
@router.get("/api/v1/drives/{drive_id}", response_model=DriveResponse)
|
||||||
async def get_drive(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
|
async def get_drive(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
|
||||||
cur = await db.execute(
|
cur = await db.execute(
|
||||||
_DRIVES_QUERY.format(where="WHERE d.id = ?"), (drive_id,)
|
_DRIVES_QUERY.format(where="AND d.id = ?"), (drive_id,)
|
||||||
)
|
)
|
||||||
row = await cur.fetchone()
|
row = await cur.fetchone()
|
||||||
if not row:
|
if not row:
|
||||||
|
|
@ -353,9 +432,13 @@ async def smart_start(
|
||||||
body: dict,
|
body: dict,
|
||||||
db: aiosqlite.Connection = Depends(get_db),
|
db: aiosqlite.Connection = Depends(get_db),
|
||||||
):
|
):
|
||||||
"""Start a standalone SHORT or LONG SMART test on a single drive."""
|
"""Start a standalone SHORT or LONG SMART test on a single drive.
|
||||||
from app.truenas import TrueNASClient
|
|
||||||
from app import burnin as _burnin
|
Uses SSH (smartctl) when configured — required for TrueNAS SCALE 25.10+
|
||||||
|
where the REST smart/test endpoint no longer exists.
|
||||||
|
Falls back to TrueNAS REST API for older versions.
|
||||||
|
"""
|
||||||
|
from app import burnin as _burnin, ssh_client
|
||||||
|
|
||||||
test_type = (body.get("type") or "").upper()
|
test_type = (body.get("type") or "").upper()
|
||||||
if test_type not in ("SHORT", "LONG"):
|
if test_type not in ("SHORT", "LONG"):
|
||||||
|
|
@ -367,16 +450,41 @@ async def smart_start(
|
||||||
raise HTTPException(status_code=404, detail="Drive not found")
|
raise HTTPException(status_code=404, detail="Drive not found")
|
||||||
devname = row[0]
|
devname = row[0]
|
||||||
|
|
||||||
# Use the shared TrueNAS client held by the burnin module
|
now = datetime.now(timezone.utc).isoformat()
|
||||||
|
ttype_lower = test_type.lower()
|
||||||
|
|
||||||
|
if ssh_client.is_configured():
|
||||||
|
# SSH path — works on TrueNAS SCALE 25.10+ and CORE
|
||||||
|
try:
|
||||||
|
output = await ssh_client.start_smart_test(devname, test_type)
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"SSH error: {exc}")
|
||||||
|
|
||||||
|
# Mark as running in DB (truenas_job_id=NULL signals SSH-managed test)
|
||||||
|
# Store smartctl start output as proof the test was initiated
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO smart_tests (drive_id, test_type, state, percent, started_at, raw_output)
|
||||||
|
VALUES (?,?,?,?,?,?)
|
||||||
|
ON CONFLICT(drive_id, test_type) DO UPDATE SET
|
||||||
|
state='running', percent=0, truenas_job_id=NULL,
|
||||||
|
started_at=excluded.started_at, finished_at=NULL, error_text=NULL,
|
||||||
|
raw_output=excluded.raw_output""",
|
||||||
|
(drive_id, ttype_lower, "running", 0, now, output),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
from app import poller as _poller
|
||||||
|
_poller._notify_subscribers()
|
||||||
|
return {"devname": devname, "type": test_type, "message": output[:200]}
|
||||||
|
|
||||||
|
else:
|
||||||
|
# REST path — older TrueNAS CORE / SCALE versions
|
||||||
client = _burnin._client
|
client = _burnin._client
|
||||||
if client is None:
|
if client is None:
|
||||||
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
|
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
tn_job_id = await client.start_smart_test([devname], test_type)
|
tn_job_id = await client.start_smart_test([devname], test_type)
|
||||||
except Exception as exc:
|
except Exception as exc:
|
||||||
raise HTTPException(status_code=502, detail=f"TrueNAS error: {exc}")
|
raise HTTPException(status_code=502, detail=f"TrueNAS error: {exc}")
|
||||||
|
|
||||||
return {"job_id": tn_job_id, "devname": devname, "type": test_type}
|
return {"job_id": tn_job_id, "devname": devname, "type": test_type}
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -403,7 +511,16 @@ async def smart_cancel(
|
||||||
if client is None:
|
if client is None:
|
||||||
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
|
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
|
||||||
|
|
||||||
# Find the running TrueNAS job for this drive/test-type
|
from app import ssh_client
|
||||||
|
|
||||||
|
if ssh_client.is_configured():
|
||||||
|
# SSH path — abort via smartctl -X
|
||||||
|
try:
|
||||||
|
await ssh_client.abort_smart_test(devname)
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"SSH abort error: {exc}")
|
||||||
|
else:
|
||||||
|
# REST path — find TrueNAS job and abort it
|
||||||
try:
|
try:
|
||||||
jobs = await client.get_smart_jobs()
|
jobs = await client.get_smart_jobs()
|
||||||
tn_job_id = None
|
tn_job_id = None
|
||||||
|
|
@ -479,13 +596,35 @@ async def burnin_start(req: StartBurninRequest):
|
||||||
drive_id, req.profile, req.operator, stage_order=req.stage_order
|
drive_id, req.profile, req.operator, stage_order=req.stage_order
|
||||||
)
|
)
|
||||||
results.append({"drive_id": drive_id, "job_id": job_id})
|
results.append({"drive_id": drive_id, "job_id": job_id})
|
||||||
|
except burnin.PoolMemberError as exc:
|
||||||
|
errors.append({
|
||||||
|
"drive_id": drive_id,
|
||||||
|
"error": str(exc),
|
||||||
|
"pool_name": exc.pool_name,
|
||||||
|
"pool_role": exc.pool_role,
|
||||||
|
"pool_locked": True,
|
||||||
|
})
|
||||||
except ValueError as exc:
|
except ValueError as exc:
|
||||||
errors.append({"drive_id": drive_id, "error": str(exc)})
|
errors.append({"drive_id": drive_id, "error": str(exc)})
|
||||||
if errors and not results:
|
if errors and not results:
|
||||||
raise HTTPException(status_code=409, detail=errors[0]["error"])
|
# Surface the first error's structured fields so the UI can render
|
||||||
|
# an unlock affordance instead of a generic toast.
|
||||||
|
raise HTTPException(status_code=409, detail=errors[0])
|
||||||
return {"queued": results, "errors": errors}
|
return {"queued": results, "errors": errors}
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/api/v1/drives/{drive_id}/unlock")
|
||||||
|
async def unlock_pool_drive(drive_id: int, req: UnlockPoolDriveRequest):
|
||||||
|
try:
|
||||||
|
expiry = await burnin.grant_pool_unlock(
|
||||||
|
drive_id, req.confirm_token, req.operator, req.reason,
|
||||||
|
)
|
||||||
|
except ValueError as exc:
|
||||||
|
raise HTTPException(status_code=400, detail=str(exc))
|
||||||
|
return {"unlocked": True, "expires_at": expiry,
|
||||||
|
"ttl_seconds": burnin.UNLOCK_TTL_SECONDS}
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/burnin/{job_id}/cancel")
|
@router.post("/api/v1/burnin/{job_id}/cancel")
|
||||||
async def burnin_cancel(job_id: int, req: CancelBurninRequest):
|
async def burnin_cancel(job_id: int, req: CancelBurninRequest):
|
||||||
ok = await burnin.cancel_job(job_id, req.operator)
|
ok = await burnin.cancel_job(job_id, req.operator)
|
||||||
|
|
@ -562,7 +701,7 @@ async def history_list(
|
||||||
jobs = [dict(r) for r in rows]
|
jobs = [dict(r) for r in rows]
|
||||||
|
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("history.html", {
|
return templates.TemplateResponse(request, "history.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"jobs": jobs,
|
"jobs": jobs,
|
||||||
"active_state": state,
|
"active_state": state,
|
||||||
|
|
@ -612,7 +751,7 @@ async def history_detail(
|
||||||
job["stages"] = [dict(r) for r in await cur.fetchall()]
|
job["stages"] = [dict(r) for r in await cur.fetchall()]
|
||||||
|
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("job_detail.html", {
|
return templates.TemplateResponse(request, "job_detail.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"job": job,
|
"job": job,
|
||||||
"poller": ps,
|
"poller": ps,
|
||||||
|
|
@ -791,7 +930,7 @@ async def audit_log(
|
||||||
cur = await db.execute(_AUDIT_QUERY)
|
cur = await db.execute(_AUDIT_QUERY)
|
||||||
rows = [dict(r) for r in await cur.fetchall()]
|
rows = [dict(r) for r in await cur.fetchall()]
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("audit.html", {
|
return templates.TemplateResponse(request, "audit.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"events": rows,
|
"events": rows,
|
||||||
"event_colors": _AUDIT_EVENT_COLORS,
|
"event_colors": _AUDIT_EVENT_COLORS,
|
||||||
|
|
@ -887,7 +1026,7 @@ async def stats_page(
|
||||||
drives_total = (await cur.fetchone())[0]
|
drives_total = (await cur.fetchone())[0]
|
||||||
|
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("stats.html", {
|
return templates.TemplateResponse(request, "stats.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"overall": overall,
|
"overall": overall,
|
||||||
"by_model": by_model,
|
"by_model": by_model,
|
||||||
|
|
@ -931,6 +1070,9 @@ async def settings_page(
|
||||||
"temp_warn_c": settings.temp_warn_c,
|
"temp_warn_c": settings.temp_warn_c,
|
||||||
"temp_crit_c": settings.temp_crit_c,
|
"temp_crit_c": settings.temp_crit_c,
|
||||||
"bad_block_threshold": settings.bad_block_threshold,
|
"bad_block_threshold": settings.bad_block_threshold,
|
||||||
|
"surface_validate_block_size": settings.surface_validate_block_size,
|
||||||
|
"surface_validate_block_buffer": settings.surface_validate_block_buffer,
|
||||||
|
"surface_validate_passes": settings.surface_validate_passes,
|
||||||
# SSH credentials (take effect immediately — each SSH call reads live settings)
|
# SSH credentials (take effect immediately — each SSH call reads live settings)
|
||||||
"ssh_host": settings.ssh_host,
|
"ssh_host": settings.ssh_host,
|
||||||
"ssh_port": settings.ssh_port,
|
"ssh_port": settings.ssh_port,
|
||||||
|
|
@ -948,7 +1090,7 @@ async def settings_page(
|
||||||
|
|
||||||
from app import ssh_client as _ssh
|
from app import ssh_client as _ssh
|
||||||
ps = poller.get_state()
|
ps = poller.get_state()
|
||||||
return templates.TemplateResponse("settings.html", {
|
return templates.TemplateResponse(request, "settings.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"editable": editable,
|
"editable": editable,
|
||||||
"smtp_enabled": bool(settings.smtp_host),
|
"smtp_enabled": bool(settings.smtp_host),
|
||||||
|
|
@ -1069,7 +1211,7 @@ async def history_print(
|
||||||
""", (job_id,))
|
""", (job_id,))
|
||||||
job["stages"] = [dict(r) for r in await cur.fetchall()]
|
job["stages"] = [dict(r) for r in await cur.fetchall()]
|
||||||
|
|
||||||
return templates.TemplateResponse("job_print.html", {
|
return templates.TemplateResponse(request, "job_print.html", {
|
||||||
"request": request,
|
"request": request,
|
||||||
"job": job,
|
"job": job,
|
||||||
})
|
})
|
||||||
|
|
|
||||||
|
|
@ -38,6 +38,9 @@ _EDITABLE: dict[str, type] = {
|
||||||
"temp_warn_c": int,
|
"temp_warn_c": int,
|
||||||
"temp_crit_c": int,
|
"temp_crit_c": int,
|
||||||
"bad_block_threshold": int,
|
"bad_block_threshold": int,
|
||||||
|
"surface_validate_block_size": int,
|
||||||
|
"surface_validate_block_buffer": int,
|
||||||
|
"surface_validate_passes": int,
|
||||||
# SSH credentials — take effect immediately (each connection reads live settings)
|
# SSH credentials — take effect immediately (each connection reads live settings)
|
||||||
"ssh_host": str,
|
"ssh_host": str,
|
||||||
"ssh_port": int,
|
"ssh_port": int,
|
||||||
|
|
@ -96,6 +99,26 @@ def _apply(data: dict) -> None:
|
||||||
if key == "bad_block_threshold" and int(val) < 0:
|
if key == "bad_block_threshold" and int(val) < 0:
|
||||||
log.warning("settings_store: bad_block_threshold must be >= 0 — ignoring")
|
log.warning("settings_store: bad_block_threshold must be >= 0 — ignoring")
|
||||||
continue
|
continue
|
||||||
|
if key == "surface_validate_block_size":
|
||||||
|
# badblocks accepts any positive int but in practice the
|
||||||
|
# useful range is 512..1048576 and it should be a power of 2.
|
||||||
|
v = int(val)
|
||||||
|
if v < 512 or v > 1048576 or (v & (v - 1)) != 0:
|
||||||
|
log.warning(
|
||||||
|
"settings_store: surface_validate_block_size must be "
|
||||||
|
"a power of 2 between 512 and 1048576 — ignoring %r", val
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
if key == "surface_validate_block_buffer" and not (1 <= int(val) <= 4096):
|
||||||
|
log.warning(
|
||||||
|
"settings_store: surface_validate_block_buffer must be 1..4096 — ignoring"
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
if key == "surface_validate_passes" and not (0 <= int(val) <= 16):
|
||||||
|
log.warning(
|
||||||
|
"settings_store: surface_validate_passes must be 0..16 — ignoring"
|
||||||
|
)
|
||||||
|
continue
|
||||||
if key == "ssh_port" and not (1 <= int(val) <= 65535):
|
if key == "ssh_port" and not (1 <= int(val) <= 65535):
|
||||||
log.warning("settings_store: ssh_port out of range — ignoring")
|
log.warning("settings_store: ssh_port out of range — ignoring")
|
||||||
continue
|
continue
|
||||||
|
|
|
||||||
|
|
@ -38,15 +38,26 @@ SMART_ATTRS: dict[int, tuple[str, bool]] = {
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
def is_configured() -> bool:
|
def is_configured() -> bool:
|
||||||
"""Returns True when SSH credentials are present and usable."""
|
"""Returns True when SSH host + at least one auth method is available."""
|
||||||
|
import os
|
||||||
from app.config import settings
|
from app.config import settings
|
||||||
return bool(settings.ssh_host and (settings.ssh_password or settings.ssh_key))
|
if not settings.ssh_host:
|
||||||
|
return False
|
||||||
|
has_creds = bool(
|
||||||
|
settings.ssh_key
|
||||||
|
or settings.ssh_password
|
||||||
|
or os.path.exists(os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH))
|
||||||
|
)
|
||||||
|
return has_creds
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Low-level connection
|
# Low-level connection
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_MOUNTED_KEY_PATH = "/run/secrets/ssh_key"
|
||||||
|
|
||||||
|
|
||||||
async def _connect():
|
async def _connect():
|
||||||
"""Open a single-use SSH connection. Caller must use `async with`."""
|
"""Open a single-use SSH connection. Caller must use `async with`."""
|
||||||
import asyncssh
|
import asyncssh
|
||||||
|
|
@ -59,9 +70,17 @@ async def _connect():
|
||||||
"known_hosts": None, # trust all hosts (same spirit as TRUENAS_VERIFY_TLS=false)
|
"known_hosts": None, # trust all hosts (same spirit as TRUENAS_VERIFY_TLS=false)
|
||||||
}
|
}
|
||||||
if settings.ssh_key:
|
if settings.ssh_key:
|
||||||
|
# Key material provided via env var (base case)
|
||||||
kwargs["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)]
|
kwargs["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)]
|
||||||
if settings.ssh_password:
|
elif settings.ssh_password:
|
||||||
kwargs["password"] = settings.ssh_password
|
kwargs["password"] = settings.ssh_password
|
||||||
|
else:
|
||||||
|
# Fall back to mounted key file (preferred for production — no key in env vars)
|
||||||
|
import os
|
||||||
|
key_path = os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH)
|
||||||
|
if os.path.exists(key_path):
|
||||||
|
kwargs["client_keys"] = [key_path]
|
||||||
|
# If nothing is configured, asyncssh will attempt agent/default key lookup
|
||||||
|
|
||||||
return asyncssh.connect(**kwargs)
|
return asyncssh.connect(**kwargs)
|
||||||
|
|
||||||
|
|
@ -228,6 +247,294 @@ async def run_badblocks(
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_zpool_list_output(stdout: str) -> dict:
|
||||||
|
"""Pure parser for `zpool list -vHP` stdout. Exposed for unit tests.
|
||||||
|
|
||||||
|
See get_pool_membership() for output semantics. This function never
|
||||||
|
raises — malformed lines are silently skipped.
|
||||||
|
"""
|
||||||
|
import re as _re
|
||||||
|
|
||||||
|
def _strip_partition(name: str) -> str:
|
||||||
|
m = _re.match(r"^(nvme\d+n\d+)", name)
|
||||||
|
if m:
|
||||||
|
return m.group(1)
|
||||||
|
m = _re.match(r"^(sd[a-z]+)", name)
|
||||||
|
if m:
|
||||||
|
return m.group(1)
|
||||||
|
return name
|
||||||
|
|
||||||
|
SECTION_MARKERS = {"cache", "log", "logs", "spare", "spares",
|
||||||
|
"special", "dedup"}
|
||||||
|
SECTION_NORMALIZE = {"logs": "log", "spares": "spare"}
|
||||||
|
|
||||||
|
out: dict = {}
|
||||||
|
current_pool: str | None = None
|
||||||
|
current_role: str = "data"
|
||||||
|
|
||||||
|
for raw in stdout.splitlines():
|
||||||
|
if not raw.strip():
|
||||||
|
continue
|
||||||
|
depth = 0
|
||||||
|
while depth < len(raw) and raw[depth] == "\t":
|
||||||
|
depth += 1
|
||||||
|
first = raw[depth:].split("\t", 1)[0].strip()
|
||||||
|
|
||||||
|
if depth == 0:
|
||||||
|
current_pool = first
|
||||||
|
current_role = "data"
|
||||||
|
continue
|
||||||
|
|
||||||
|
if depth == 1:
|
||||||
|
if first in SECTION_MARKERS:
|
||||||
|
current_role = SECTION_NORMALIZE.get(first, first)
|
||||||
|
continue
|
||||||
|
if first.startswith(("mirror", "raidz", "draid")):
|
||||||
|
continue
|
||||||
|
if first.startswith("/dev/") and current_pool:
|
||||||
|
dn = _strip_partition(first[len("/dev/"):])
|
||||||
|
out[dn] = {"pool": current_pool, "role": current_role}
|
||||||
|
continue
|
||||||
|
|
||||||
|
if first.startswith("/dev/") and current_pool:
|
||||||
|
dn = _strip_partition(first[len("/dev/"):])
|
||||||
|
out[dn] = {"pool": current_pool, "role": current_role}
|
||||||
|
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_lsblk_zfs_output(stdout: str) -> set:
|
||||||
|
"""Pure parser for `lsblk -no NAME,FSTYPE -l` stdout. Returns base
|
||||||
|
devnames carrying ZFS labels (whole-disk OR via partition). Exposed
|
||||||
|
for unit tests."""
|
||||||
|
import re as _re
|
||||||
|
out: set = set()
|
||||||
|
for line in stdout.splitlines():
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) < 2:
|
||||||
|
continue
|
||||||
|
name, fstype = parts[0], parts[1]
|
||||||
|
if fstype != "zfs_member":
|
||||||
|
continue
|
||||||
|
if name.startswith("nvme"):
|
||||||
|
m = _re.match(r"^(nvme\d+n\d+)", name)
|
||||||
|
if m:
|
||||||
|
out.add(m.group(1))
|
||||||
|
else:
|
||||||
|
m = _re.match(r"^(sd[a-z]+)", name)
|
||||||
|
if m:
|
||||||
|
out.add(m.group(1))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
async def get_pool_membership() -> dict | None:
|
||||||
|
"""Return {devname: {"pool": str, "role": str}} for every drive in any zpool.
|
||||||
|
|
||||||
|
Parses `zpool list -vHP` output. Tab-indent depth tells us structure:
|
||||||
|
depth 0 pool name line
|
||||||
|
depth 1 vdev type line (mirror-N, raidz*N, draid*) OR section
|
||||||
|
marker (cache/log/spare/special/dedup/logs) OR a single-disk
|
||||||
|
vdev that is itself a /dev/... entry
|
||||||
|
depth 2 device line within a vdev — '/dev/sdX', '/dev/nvmeXnY', etc.
|
||||||
|
may have a partition suffix that we strip back to the
|
||||||
|
base devname so it matches what TrueNAS reports.
|
||||||
|
Roles: data | cache | log | spare | special | dedup
|
||||||
|
Returns:
|
||||||
|
- {} when the SSH call succeeded and there are genuinely no pools
|
||||||
|
- None on any failure (SSH down, parse error, non-zero exit, no
|
||||||
|
stdout). Callers MUST treat None differently from {}: an
|
||||||
|
empty dict is "definitely no pool members," None is "we
|
||||||
|
couldn't tell." Treating None as "no pool members" is a
|
||||||
|
fail-open security regression.
|
||||||
|
"""
|
||||||
|
import re as _re
|
||||||
|
if not is_configured():
|
||||||
|
return {}
|
||||||
|
cmd = "zpool list -vHP 2>/dev/null"
|
||||||
|
try:
|
||||||
|
async with await _connect() as conn:
|
||||||
|
r = await conn.run(cmd, check=False)
|
||||||
|
if r.returncode != 0:
|
||||||
|
return None
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
if not r.stdout:
|
||||||
|
# rc==0 with empty output = host has no pools. (`zpool list -H`
|
||||||
|
# returns no rows when zero pools are imported.) That's a real
|
||||||
|
# answer, not a failure.
|
||||||
|
return {}
|
||||||
|
return _parse_zpool_list_output(r.stdout)
|
||||||
|
|
||||||
|
|
||||||
|
async def get_smart_health_map(devnames: list[str]) -> dict | None:
|
||||||
|
"""Return {devname: 'PASSED'|'FAILED'|'UNKNOWN'} for every devname.
|
||||||
|
|
||||||
|
Runs `smartctl -H` for each disk in a single SSH session — much faster
|
||||||
|
than one connection per disk. Returns None on any SSH failure so the
|
||||||
|
poller can fall back to the previously-stored health value rather than
|
||||||
|
silently overwriting everything as 'UNKNOWN'.
|
||||||
|
|
||||||
|
`smartctl -H` is the cheap SMART self-assessment lookup (no full
|
||||||
|
attribute scan) — milliseconds per drive. The output format is stable:
|
||||||
|
SMART overall-health self-assessment test result: PASSED
|
||||||
|
SMART overall-health self-assessment test result: FAILED!
|
||||||
|
For drives that don't support the command at all, smartctl exits
|
||||||
|
non-zero and we record UNKNOWN for that device specifically.
|
||||||
|
"""
|
||||||
|
if not is_configured() or not devnames:
|
||||||
|
return {} if devnames else None
|
||||||
|
# Build one shell pipeline that prefixes each result with "@@DEVNAME@@"
|
||||||
|
# so we can split the combined stdout deterministically.
|
||||||
|
parts = []
|
||||||
|
for d in devnames:
|
||||||
|
# Reject anything that doesn't look like a basic devname so we
|
||||||
|
# never inject shell metacharacters into the remote command.
|
||||||
|
if not d.replace("nvme", "").replace("n", "").replace("p", "").replace("sd", "").isalnum():
|
||||||
|
continue
|
||||||
|
parts.append(f"echo '@@{d}@@'; smartctl -H /dev/{d} 2>&1; echo '@@END@@'")
|
||||||
|
if not parts:
|
||||||
|
return {}
|
||||||
|
cmd = "; ".join(parts)
|
||||||
|
try:
|
||||||
|
async with await _connect() as conn:
|
||||||
|
r = await asyncio.wait_for(conn.run(cmd, check=False), timeout=30)
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
if not r.stdout:
|
||||||
|
return None
|
||||||
|
return _parse_smart_health_batch(r.stdout)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_smart_health_batch(stdout: str) -> dict:
|
||||||
|
"""Pure parser for the batched smartctl -H output. Exposed for tests."""
|
||||||
|
result: dict[str, str] = {}
|
||||||
|
current: str | None = None
|
||||||
|
buf: list[str] = []
|
||||||
|
|
||||||
|
def _flush():
|
||||||
|
if current is None:
|
||||||
|
return
|
||||||
|
text = "\n".join(buf)
|
||||||
|
if "PASSED" in text:
|
||||||
|
result[current] = "PASSED"
|
||||||
|
elif "FAILED" in text or "FAILURE" in text:
|
||||||
|
result[current] = "FAILED"
|
||||||
|
else:
|
||||||
|
result[current] = "UNKNOWN"
|
||||||
|
|
||||||
|
for raw in stdout.splitlines():
|
||||||
|
line = raw.strip()
|
||||||
|
if line.startswith("@@") and line.endswith("@@"):
|
||||||
|
inner = line[2:-2]
|
||||||
|
if inner == "END":
|
||||||
|
_flush()
|
||||||
|
current = None
|
||||||
|
buf = []
|
||||||
|
else:
|
||||||
|
_flush()
|
||||||
|
current = inner
|
||||||
|
buf = []
|
||||||
|
else:
|
||||||
|
buf.append(line)
|
||||||
|
_flush()
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
async def get_zfs_member_drives() -> set | None:
|
||||||
|
"""Return devnames of every drive whose partitions carry a ZFS label.
|
||||||
|
|
||||||
|
Combined with get_pool_membership(): a drive in this set but NOT in the
|
||||||
|
active-pool map carries ZFS data from a previously-imported pool that
|
||||||
|
was exported (or imported on a different system). We treat those as
|
||||||
|
locked too — wiping them would silently destroy a pool.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
- set() when lsblk succeeded and no drives carry ZFS labels
|
||||||
|
- None on any failure. Same fail-closed semantics as
|
||||||
|
get_pool_membership() — callers must NOT treat None as
|
||||||
|
"no exported drives," that's a security regression.
|
||||||
|
"""
|
||||||
|
if not is_configured():
|
||||||
|
return set()
|
||||||
|
cmd = "lsblk -no NAME,FSTYPE -l 2>/dev/null"
|
||||||
|
try:
|
||||||
|
async with await _connect() as conn:
|
||||||
|
r = await conn.run(cmd, check=False)
|
||||||
|
if r.returncode != 0:
|
||||||
|
return None
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
if not r.stdout:
|
||||||
|
# lsblk with rc==0 and no output is impossible on a normal Linux
|
||||||
|
# host; treat as failure rather than "no drives at all."
|
||||||
|
return None
|
||||||
|
return _parse_lsblk_zfs_output(r.stdout)
|
||||||
|
|
||||||
|
|
||||||
|
async def get_system_sensors() -> dict:
|
||||||
|
"""
|
||||||
|
Run `sensors -j` on TrueNAS and extract system-level temperatures.
|
||||||
|
Returns {"cpu_c": int|None, "pch_c": int|None}.
|
||||||
|
cpu_c = CPU package temp (coretemp chip)
|
||||||
|
pch_c = PCH/chipset temp (pch_* chip) — proxy for storage I/O lane thermals
|
||||||
|
Falls back gracefully if SSH is not configured or lm-sensors is unavailable.
|
||||||
|
"""
|
||||||
|
if not is_configured():
|
||||||
|
return {}
|
||||||
|
try:
|
||||||
|
async with await _connect() as conn:
|
||||||
|
result = await conn.run("sensors -j 2>/dev/null", check=False)
|
||||||
|
output = result.stdout.strip()
|
||||||
|
if not output:
|
||||||
|
return {}
|
||||||
|
return _parse_sensors_json(output)
|
||||||
|
except Exception as exc:
|
||||||
|
log.debug("get_system_sensors failed: %s", exc)
|
||||||
|
return {}
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_sensors_json(output: str) -> dict:
|
||||||
|
import json as _json
|
||||||
|
try:
|
||||||
|
data = _json.loads(output)
|
||||||
|
except Exception:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
cpu_c: int | None = None
|
||||||
|
pch_c: int | None = None
|
||||||
|
|
||||||
|
for chip_name, chip_data in data.items():
|
||||||
|
if not isinstance(chip_data, dict):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# CPU package temp — coretemp chip, "Package id N" sensor
|
||||||
|
if chip_name.startswith("coretemp") and cpu_c is None:
|
||||||
|
for sensor_name, sensor_vals in chip_data.items():
|
||||||
|
if not isinstance(sensor_vals, dict):
|
||||||
|
continue
|
||||||
|
if "package" in sensor_name.lower():
|
||||||
|
for k, v in sensor_vals.items():
|
||||||
|
if k.endswith("_input") and isinstance(v, (int, float)):
|
||||||
|
cpu_c = int(round(v))
|
||||||
|
break
|
||||||
|
if cpu_c is not None:
|
||||||
|
break
|
||||||
|
|
||||||
|
# PCH / chipset temp — manages PCIe lanes including HBA / storage I/O
|
||||||
|
elif chip_name.startswith("pch_") and pch_c is None:
|
||||||
|
for sensor_name, sensor_vals in chip_data.items():
|
||||||
|
if not isinstance(sensor_vals, dict):
|
||||||
|
continue
|
||||||
|
for k, v in sensor_vals.items():
|
||||||
|
if k.endswith("_input") and isinstance(v, (int, float)):
|
||||||
|
pch_c = int(round(v))
|
||||||
|
break
|
||||||
|
if pch_c is not None:
|
||||||
|
break
|
||||||
|
|
||||||
|
return {"cpu_c": cpu_c, "pch_c": pch_c}
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Parsers
|
# Parsers
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
@ -275,7 +582,7 @@ def _parse_smartctl(output: str) -> dict:
|
||||||
|
|
||||||
def _parse_smart_progress(output: str) -> dict:
|
def _parse_smart_progress(output: str) -> dict:
|
||||||
state = "unknown"
|
state = "unknown"
|
||||||
percent_remaining = 0
|
percent_remaining = None # None = "in progress but no % line parsed yet"
|
||||||
|
|
||||||
lower = output.lower()
|
lower = output.lower()
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -281,7 +281,11 @@ tr:hover td {
|
||||||
.col-size { min-width: 70px; text-align: right; }
|
.col-size { min-width: 70px; text-align: right; }
|
||||||
.col-temp { min-width: 75px; text-align: right; }
|
.col-temp { min-width: 75px; text-align: right; }
|
||||||
.col-health { min-width: 85px; }
|
.col-health { min-width: 85px; }
|
||||||
.col-smart { min-width: 150px; }
|
.col-smart { min-width: 95px; }
|
||||||
|
/* Tighter horizontal padding on the SMART columns — they hold short
|
||||||
|
pills ("Passed"/"—") or a progress bar, so the default 14px gutter
|
||||||
|
wastes space on 13" laptops. */
|
||||||
|
th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
|
||||||
.col-actions { min-width: 170px; }
|
.col-actions { min-width: 170px; }
|
||||||
|
|
||||||
/* -----------------------------------------------------------------------
|
/* -----------------------------------------------------------------------
|
||||||
|
|
@ -1076,6 +1080,56 @@ a.stat-card:hover {
|
||||||
.stat-passed .stat-value { color: var(--green); }
|
.stat-passed .stat-value { color: var(--green); }
|
||||||
.stat-idle .stat-value { color: var(--text-muted); }
|
.stat-idle .stat-value { color: var(--text-muted); }
|
||||||
|
|
||||||
|
/* Vertical separator between drive-count cards and sensor chips */
|
||||||
|
.stats-bar-sep {
|
||||||
|
width: 1px;
|
||||||
|
height: 36px;
|
||||||
|
background: var(--border);
|
||||||
|
align-self: center;
|
||||||
|
flex-shrink: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Compact sensor chip — CPU / PCH / Thermal */
|
||||||
|
.stat-sensor {
|
||||||
|
background: var(--bg-card);
|
||||||
|
border: 1px solid var(--border);
|
||||||
|
border-radius: 8px;
|
||||||
|
padding: 6px 12px;
|
||||||
|
text-align: center;
|
||||||
|
min-width: 52px;
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
gap: 2px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.stat-sensor-val {
|
||||||
|
font-size: 16px;
|
||||||
|
font-weight: 700;
|
||||||
|
font-variant-numeric: tabular-nums;
|
||||||
|
line-height: 1.1;
|
||||||
|
}
|
||||||
|
|
||||||
|
.stat-sensor-label {
|
||||||
|
font-size: 9px;
|
||||||
|
text-transform: uppercase;
|
||||||
|
letter-spacing: 0.08em;
|
||||||
|
color: var(--text-muted);
|
||||||
|
line-height: 1.2;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Thermal pressure states */
|
||||||
|
.stat-sensor-thermal-warn {
|
||||||
|
border-color: var(--yellow-bd);
|
||||||
|
background: var(--yellow-bg);
|
||||||
|
}
|
||||||
|
.stat-sensor-thermal-warn .stat-sensor-val { color: var(--yellow); }
|
||||||
|
|
||||||
|
.stat-sensor-thermal-crit {
|
||||||
|
border-color: var(--red-bd);
|
||||||
|
background: var(--red-bg);
|
||||||
|
}
|
||||||
|
.stat-sensor-thermal-crit .stat-sensor-val { color: var(--red); }
|
||||||
|
|
||||||
/* -----------------------------------------------------------------------
|
/* -----------------------------------------------------------------------
|
||||||
Batch action bar (inside filter-bar)
|
Batch action bar (inside filter-bar)
|
||||||
----------------------------------------------------------------------- */
|
----------------------------------------------------------------------- */
|
||||||
|
|
@ -2372,6 +2426,85 @@ tr.drawer-row-active {
|
||||||
color: var(--yellow);
|
color: var(--yellow);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* -----------------------------------------------------------------------
|
||||||
|
Pool-membership lock indicators
|
||||||
|
----------------------------------------------------------------------- */
|
||||||
|
.pool-lock-icon {
|
||||||
|
display: inline-block;
|
||||||
|
margin-right: 4px;
|
||||||
|
font-size: 12px;
|
||||||
|
color: var(--yellow);
|
||||||
|
vertical-align: baseline;
|
||||||
|
}
|
||||||
|
.pool-lock-icon.pool-lock-boot {
|
||||||
|
color: var(--red, #e25555);
|
||||||
|
}
|
||||||
|
.pool-pill {
|
||||||
|
display: inline-block;
|
||||||
|
margin-top: 3px;
|
||||||
|
padding: 1px 7px;
|
||||||
|
font-size: 10.5px;
|
||||||
|
font-weight: 600;
|
||||||
|
letter-spacing: 0.3px;
|
||||||
|
text-transform: uppercase;
|
||||||
|
border-radius: 4px;
|
||||||
|
background: color-mix(in srgb, var(--yellow) 14%, transparent);
|
||||||
|
color: var(--yellow);
|
||||||
|
border: 1px solid color-mix(in srgb, var(--yellow) 35%, transparent);
|
||||||
|
}
|
||||||
|
.pool-pill.pool-pill-boot {
|
||||||
|
background: color-mix(in srgb, var(--red, #e25555) 16%, transparent);
|
||||||
|
color: var(--red, #e25555);
|
||||||
|
border-color: color-mix(in srgb, var(--red, #e25555) 45%, transparent);
|
||||||
|
}
|
||||||
|
.pool-pill.pool-pill-exported {
|
||||||
|
background: color-mix(in srgb, #e07a3f 16%, transparent);
|
||||||
|
color: #e07a3f;
|
||||||
|
border-color: color-mix(in srgb, #e07a3f 45%, transparent);
|
||||||
|
}
|
||||||
|
.pool-lock-icon.pool-lock-exported {
|
||||||
|
color: #e07a3f;
|
||||||
|
}
|
||||||
|
.btn-unlock {
|
||||||
|
background: transparent;
|
||||||
|
border: 1px solid color-mix(in srgb, var(--yellow) 50%, transparent);
|
||||||
|
color: var(--yellow);
|
||||||
|
border-radius: 5px;
|
||||||
|
padding: 3px 9px;
|
||||||
|
font-size: 12px;
|
||||||
|
cursor: pointer;
|
||||||
|
transition: background .15s, color .15s, border-color .15s;
|
||||||
|
}
|
||||||
|
.btn-unlock:hover {
|
||||||
|
background: color-mix(in srgb, var(--yellow) 14%, transparent);
|
||||||
|
}
|
||||||
|
.btn-unlock-boot {
|
||||||
|
border-color: color-mix(in srgb, var(--red, #e25555) 55%, transparent);
|
||||||
|
color: var(--red, #e25555);
|
||||||
|
}
|
||||||
|
.btn-unlock-boot:hover {
|
||||||
|
background: color-mix(in srgb, var(--red, #e25555) 14%, transparent);
|
||||||
|
}
|
||||||
|
.btn-unlock-exported {
|
||||||
|
border-color: color-mix(in srgb, #e07a3f 55%, transparent);
|
||||||
|
color: #e07a3f;
|
||||||
|
}
|
||||||
|
.btn-unlock-exported:hover {
|
||||||
|
background: color-mix(in srgb, #e07a3f 14%, transparent);
|
||||||
|
}
|
||||||
|
.unlock-countdown {
|
||||||
|
margin-left: 4px;
|
||||||
|
font-size: 11px;
|
||||||
|
color: var(--green, #39c179);
|
||||||
|
font-variant-numeric: tabular-nums;
|
||||||
|
}
|
||||||
|
.unlock-countdown-expired {
|
||||||
|
color: var(--yellow);
|
||||||
|
}
|
||||||
|
.modal.modal-danger {
|
||||||
|
border-top: 3px solid var(--red, #e25555);
|
||||||
|
}
|
||||||
|
|
||||||
/* -----------------------------------------------------------------------
|
/* -----------------------------------------------------------------------
|
||||||
Parallel burn-in inline warning
|
Parallel burn-in inline warning
|
||||||
----------------------------------------------------------------------- */
|
----------------------------------------------------------------------- */
|
||||||
|
|
@ -2409,41 +2542,3 @@ tr.drawer-row-active {
|
||||||
font-variant-numeric: tabular-nums;
|
font-variant-numeric: tabular-nums;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* -----------------------------------------------------------------------
|
|
||||||
Live Terminal drawer panel (xterm.js)
|
|
||||||
----------------------------------------------------------------------- */
|
|
||||||
.drawer-panel-terminal {
|
|
||||||
padding: 0 !important;
|
|
||||||
overflow: hidden !important;
|
|
||||||
position: relative;
|
|
||||||
background: #0d1117;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Let xterm fill the full panel height */
|
|
||||||
.drawer-panel-terminal .xterm {
|
|
||||||
height: 100%;
|
|
||||||
}
|
|
||||||
.drawer-panel-terminal .xterm-viewport {
|
|
||||||
overflow-y: auto !important;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Reconnect bar — floats over the terminal when disconnected */
|
|
||||||
.term-reconnect-bar {
|
|
||||||
position: absolute;
|
|
||||||
bottom: 12px;
|
|
||||||
right: 12px;
|
|
||||||
z-index: 20;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
gap: 8px;
|
|
||||||
background: rgba(13,17,23,0.85);
|
|
||||||
border: 1px solid var(--border);
|
|
||||||
border-radius: 6px;
|
|
||||||
padding: 6px 10px;
|
|
||||||
font-size: 12px;
|
|
||||||
color: var(--text-muted);
|
|
||||||
}
|
|
||||||
.term-reconnect-bar .btn-secondary {
|
|
||||||
padding: 3px 10px;
|
|
||||||
font-size: 11px;
|
|
||||||
}
|
|
||||||
|
|
|
||||||
|
|
@ -68,6 +68,7 @@
|
||||||
applyFilter(activeFilter);
|
applyFilter(activeFilter);
|
||||||
restoreCheckboxes();
|
restoreCheckboxes();
|
||||||
initElapsedTimers();
|
initElapsedTimers();
|
||||||
|
initUnlockCountdowns();
|
||||||
initLocationEdits();
|
initLocationEdits();
|
||||||
if (_drawerDriveId) {
|
if (_drawerDriveId) {
|
||||||
_drawerHighlightRow(_drawerDriveId);
|
_drawerHighlightRow(_drawerDriveId);
|
||||||
|
|
@ -135,14 +136,59 @@
|
||||||
if (nb) nb.style.display = 'none';
|
if (nb) nb.style.display = 'none';
|
||||||
}
|
}
|
||||||
|
|
||||||
// Handle job-alert SSE events for browser notifications
|
// Handle SSE events
|
||||||
document.addEventListener('htmx:sseMessage', function (e) {
|
document.addEventListener('htmx:sseMessage', function (e) {
|
||||||
if (!e.detail || e.detail.type !== 'job-alert') return;
|
if (!e.detail) return;
|
||||||
try {
|
if (e.detail.type === 'job-alert') {
|
||||||
handleJobAlert(JSON.parse(e.detail.data));
|
try { handleJobAlert(JSON.parse(e.detail.data)); } catch (_) {}
|
||||||
} catch (_) {}
|
} else if (e.detail.type === 'system-sensors') {
|
||||||
|
try { handleSystemSensors(JSON.parse(e.detail.data)); } catch (_) {}
|
||||||
|
}
|
||||||
});
|
});
|
||||||
|
|
||||||
|
function handleSystemSensors(data) {
|
||||||
|
var st = data.system_temps || {};
|
||||||
|
var tp = data.thermal_pressure || 'ok';
|
||||||
|
var warn = data.temp_warn_c || 46;
|
||||||
|
var crit = data.temp_crit_c || 55;
|
||||||
|
|
||||||
|
function tempClass(c) {
|
||||||
|
if (c == null) return '';
|
||||||
|
return c >= crit ? 'temp-hot' : c >= warn ? 'temp-warm' : 'temp-cool';
|
||||||
|
}
|
||||||
|
|
||||||
|
// CPU chip
|
||||||
|
var cpuChip = document.getElementById('sensor-cpu');
|
||||||
|
var cpuVal = document.getElementById('sensor-cpu-val');
|
||||||
|
if (cpuVal && st.cpu_c != null) {
|
||||||
|
if (cpuChip) cpuChip.hidden = false;
|
||||||
|
cpuVal.textContent = st.cpu_c + '°';
|
||||||
|
cpuVal.className = 'stat-sensor-val ' + tempClass(st.cpu_c);
|
||||||
|
}
|
||||||
|
|
||||||
|
// PCH chip
|
||||||
|
var pchChip = document.getElementById('sensor-pch');
|
||||||
|
var pchVal = document.getElementById('sensor-pch-val');
|
||||||
|
if (pchVal && st.pch_c != null) {
|
||||||
|
if (pchChip) pchChip.hidden = false;
|
||||||
|
pchVal.textContent = st.pch_c + '°';
|
||||||
|
pchVal.className = 'stat-sensor-val ' + tempClass(st.pch_c);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Thermal pressure chip
|
||||||
|
var tChip = document.getElementById('sensor-thermal');
|
||||||
|
var tVal = document.getElementById('sensor-thermal-val');
|
||||||
|
if (tChip && tVal) {
|
||||||
|
if (tp === 'warn' || tp === 'crit') {
|
||||||
|
tChip.hidden = false;
|
||||||
|
tChip.className = 'stat-sensor stat-sensor-thermal stat-sensor-thermal-' + tp;
|
||||||
|
tVal.textContent = tp === 'warn' ? 'WARM' : 'HOT';
|
||||||
|
} else {
|
||||||
|
tChip.hidden = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
function handleJobAlert(data) {
|
function handleJobAlert(data) {
|
||||||
var isPass = data.state === 'passed';
|
var isPass = data.state === 'passed';
|
||||||
var icon = isPass ? '✓' : '✕';
|
var icon = isPass ? '✓' : '✕';
|
||||||
|
|
@ -203,6 +249,41 @@
|
||||||
|
|
||||||
initElapsedTimers();
|
initElapsedTimers();
|
||||||
|
|
||||||
|
// Live countdown for pool-drive unlock TTL — runs once per second; ticker
|
||||||
|
// self-stops when no .unlock-countdown spans remain on the page.
|
||||||
|
var _unlockTickInterval = null;
|
||||||
|
function tickUnlockCountdowns() {
|
||||||
|
var spans = document.querySelectorAll('.unlock-countdown[data-expires]');
|
||||||
|
if (spans.length === 0) {
|
||||||
|
if (_unlockTickInterval) {
|
||||||
|
clearInterval(_unlockTickInterval);
|
||||||
|
_unlockTickInterval = null;
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
var nowSec = Date.now() / 1000;
|
||||||
|
spans.forEach(function (el) {
|
||||||
|
var exp = parseFloat(el.dataset.expires);
|
||||||
|
if (!exp || isNaN(exp)) return;
|
||||||
|
var rem = Math.max(0, exp - nowSec);
|
||||||
|
if (rem <= 0) {
|
||||||
|
el.textContent = 'expired';
|
||||||
|
el.className = 'unlock-countdown unlock-countdown-expired';
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
var m = Math.floor(rem / 60);
|
||||||
|
var s = Math.floor(rem % 60);
|
||||||
|
el.textContent = '\u{1F513} ' + m + ':' + (s < 10 ? '0' : '') + s;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
function initUnlockCountdowns() {
|
||||||
|
if (_unlockTickInterval) return;
|
||||||
|
if (document.querySelectorAll('.unlock-countdown[data-expires]').length === 0) return;
|
||||||
|
_unlockTickInterval = setInterval(tickUnlockCountdowns, 1000);
|
||||||
|
tickUnlockCountdowns();
|
||||||
|
}
|
||||||
|
initUnlockCountdowns();
|
||||||
|
|
||||||
// -----------------------------------------------------------------------
|
// -----------------------------------------------------------------------
|
||||||
// Inline location / notes edit
|
// Inline location / notes edit
|
||||||
// -----------------------------------------------------------------------
|
// -----------------------------------------------------------------------
|
||||||
|
|
@ -538,7 +619,16 @@
|
||||||
|
|
||||||
var data = await resp.json();
|
var data = await resp.json();
|
||||||
if (!resp.ok) {
|
if (!resp.ok) {
|
||||||
showToast(data.detail || 'Failed to start burn-in', 'error');
|
// detail may be the structured pool-locked object {drive_id,
|
||||||
|
// pool_name, pool_role, pool_locked: true, error: "..."}.
|
||||||
|
// The user already opened the start modal, so the unlock TTL must
|
||||||
|
// have just expired between modal-open and submit. Auto-flip to
|
||||||
|
// the unlock modal for that drive.
|
||||||
|
if (_handlePoolLockedError(data.detail)) {
|
||||||
|
closeModal();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
showToast(_extractErrorMessage(data.detail) || 'Failed to start burn-in', 'error');
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -549,6 +639,161 @@
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Helpers shared between single-drive and batch start error paths.
|
||||||
|
// Backend returns either a string (legacy errors) or, for pool-locked
|
||||||
|
// drives, an object: {drive_id, error, pool_name, pool_role, pool_locked}.
|
||||||
|
function _extractErrorMessage(detail) {
|
||||||
|
if (!detail) return null;
|
||||||
|
if (typeof detail === 'string') return detail;
|
||||||
|
if (typeof detail === 'object' && detail.error) return detail.error;
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
// Returns true if it handled a pool-locked error by opening the unlock
|
||||||
|
// modal for the offending drive. Caller should bail out.
|
||||||
|
function _handlePoolLockedError(detail) {
|
||||||
|
if (!detail || typeof detail !== 'object' || !detail.pool_locked) return false;
|
||||||
|
var driveId = detail.drive_id;
|
||||||
|
if (driveId == null) return false;
|
||||||
|
var btn = document.querySelector('.btn-unlock[data-drive-id="' + driveId + '"]');
|
||||||
|
if (btn) {
|
||||||
|
// openUnlockModal closes any other open modals as a side effect of
|
||||||
|
// calling its own close handlers; we still need to close the
|
||||||
|
// start/batch modal explicitly in the caller, since openUnlockModal
|
||||||
|
// doesn't know which one is open.
|
||||||
|
openUnlockModal(btn);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
// Unlock button not in the DOM (drive row may have refreshed).
|
||||||
|
// Surface a descriptive toast instead of [object Object].
|
||||||
|
showToast(
|
||||||
|
(detail.error || 'Drive is pool-locked') +
|
||||||
|
' Reload the page and click Unlock on the drive row.',
|
||||||
|
'error',
|
||||||
|
);
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
// -----------------------------------------------------------------------
|
||||||
|
// Pool-drive Unlock modal
|
||||||
|
// -----------------------------------------------------------------------
|
||||||
|
|
||||||
|
var unlockDriveId = null;
|
||||||
|
var unlockExpectedToken = null;
|
||||||
|
|
||||||
|
function openUnlockModal(btn) {
|
||||||
|
unlockDriveId = btn.dataset.driveId;
|
||||||
|
var poolName = btn.dataset.poolName || '';
|
||||||
|
var poolRole = btn.dataset.poolRole || 'data';
|
||||||
|
var isBoot = btn.dataset.isBootPool === '1';
|
||||||
|
var isExported = btn.dataset.isExported === '1';
|
||||||
|
if (isBoot) unlockExpectedToken = 'DESTROY BOOT POOL';
|
||||||
|
else if (isExported) unlockExpectedToken = 'DESTROY EXPORTED POOL';
|
||||||
|
else unlockExpectedToken = poolName;
|
||||||
|
|
||||||
|
document.getElementById('unlock-devname').textContent = btn.dataset.devname || '—';
|
||||||
|
document.getElementById('unlock-model').textContent = btn.dataset.model || '—';
|
||||||
|
document.getElementById('unlock-serial').textContent = btn.dataset.serial || '—';
|
||||||
|
document.getElementById('unlock-size').textContent = btn.dataset.size || '—';
|
||||||
|
|
||||||
|
var chip = document.getElementById('unlock-pool-chip');
|
||||||
|
if (isExported) {
|
||||||
|
chip.textContent = 'exported ZFS';
|
||||||
|
chip.className = 'chip chip-aborted';
|
||||||
|
} else {
|
||||||
|
chip.textContent = poolName + ' · ' + poolRole;
|
||||||
|
chip.className = 'chip ' + (isBoot ? 'chip-failed' : 'chip-aborted');
|
||||||
|
}
|
||||||
|
|
||||||
|
var titleEl = document.getElementById('unlock-modal-title');
|
||||||
|
var warnTitle = document.getElementById('unlock-warning-title');
|
||||||
|
var warnBody = document.getElementById('unlock-warning-body');
|
||||||
|
if (isBoot) {
|
||||||
|
titleEl.textContent = 'Unlock BOOT POOL drive';
|
||||||
|
warnTitle.textContent = 'This is a TrueNAS BOOT drive.';
|
||||||
|
warnBody.textContent =
|
||||||
|
'Running burn-in on this drive will destroy the operating system on it. ' +
|
||||||
|
'If this drive is half of a mirrored boot pool, the system will continue running on the other mirror, ' +
|
||||||
|
'but you must already have a replacement plan. Proceeding without one bricks the host.';
|
||||||
|
} else if (isExported) {
|
||||||
|
titleEl.textContent = 'Unlock drive with EXPORTED ZFS data';
|
||||||
|
warnTitle.textContent = 'This drive carries ZFS data from a previously-imported pool.';
|
||||||
|
warnBody.textContent =
|
||||||
|
"TrueNAS isn't using this pool right now, but the drive still holds the labels and data. " +
|
||||||
|
'Burning it in will silently destroy whatever pool that data belongs to — including ' +
|
||||||
|
'pools that another system may be relying on. Confirm you have already evacuated or ' +
|
||||||
|
'reassigned the pool before continuing.';
|
||||||
|
} else {
|
||||||
|
titleEl.textContent = 'Unlock pool drive';
|
||||||
|
warnTitle.textContent = "This drive belongs to zpool '" + poolName + "'.";
|
||||||
|
warnBody.textContent =
|
||||||
|
'Running a destructive burn-in stage will overwrite all data on this drive ' +
|
||||||
|
'and almost certainly destroy the pool. Only proceed if you have already ' +
|
||||||
|
'removed this drive from the pool, or if you are intentionally decommissioning the pool.';
|
||||||
|
}
|
||||||
|
document.getElementById('unlock-confirm-token').textContent = unlockExpectedToken;
|
||||||
|
document.getElementById('unlock-confirm-hint').textContent = 'Expected: ' + unlockExpectedToken;
|
||||||
|
|
||||||
|
document.getElementById('unlock-confirm-input').value = '';
|
||||||
|
document.getElementById('unlock-reason-input').value = '';
|
||||||
|
var savedOp = localStorage.getItem('burnin_operator') || '';
|
||||||
|
document.getElementById('unlock-operator-input').value = savedOp;
|
||||||
|
validateUnlockModal();
|
||||||
|
|
||||||
|
document.getElementById('unlock-modal').removeAttribute('hidden');
|
||||||
|
setTimeout(function () {
|
||||||
|
document.getElementById('unlock-operator-input').focus();
|
||||||
|
}, 50);
|
||||||
|
}
|
||||||
|
|
||||||
|
function closeUnlockModal() {
|
||||||
|
document.getElementById('unlock-modal').setAttribute('hidden', '');
|
||||||
|
unlockDriveId = null;
|
||||||
|
unlockExpectedToken = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
function validateUnlockModal() {
|
||||||
|
var op = (document.getElementById('unlock-operator-input').value || '').trim();
|
||||||
|
var rsn = (document.getElementById('unlock-reason-input').value || '').trim();
|
||||||
|
var tok = (document.getElementById('unlock-confirm-input').value || '').trim();
|
||||||
|
var ok = op.length > 0 && rsn.length >= 5 && tok === unlockExpectedToken;
|
||||||
|
document.getElementById('unlock-modal-submit-btn').disabled = !ok;
|
||||||
|
}
|
||||||
|
|
||||||
|
async function submitUnlock() {
|
||||||
|
var op = (document.getElementById('unlock-operator-input').value || '').trim();
|
||||||
|
var rsn = (document.getElementById('unlock-reason-input').value || '').trim();
|
||||||
|
var tok = (document.getElementById('unlock-confirm-input').value || '').trim();
|
||||||
|
localStorage.setItem('burnin_operator', op);
|
||||||
|
|
||||||
|
var btn = document.getElementById('unlock-modal-submit-btn');
|
||||||
|
btn.disabled = true;
|
||||||
|
|
||||||
|
try {
|
||||||
|
var resp = await fetch('/api/v1/drives/' + unlockDriveId + '/unlock', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: { 'Content-Type': 'application/json' },
|
||||||
|
body: JSON.stringify({
|
||||||
|
confirm_token: tok,
|
||||||
|
operator: op,
|
||||||
|
reason: rsn,
|
||||||
|
}),
|
||||||
|
});
|
||||||
|
var data = await resp.json();
|
||||||
|
if (!resp.ok) {
|
||||||
|
showToast(data.detail || 'Unlock failed', 'error');
|
||||||
|
btn.disabled = false;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
closeUnlockModal();
|
||||||
|
showToast('Unlocked for 10 minutes — start burn-in now to use it.', 'success');
|
||||||
|
// Force a drive list refresh so the row flips from Unlock → Burn-In
|
||||||
|
if (typeof refreshDrives === 'function') refreshDrives();
|
||||||
|
} catch (err) {
|
||||||
|
showToast('Network error', 'error');
|
||||||
|
btn.disabled = false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// -----------------------------------------------------------------------
|
// -----------------------------------------------------------------------
|
||||||
// Batch Burn-In
|
// Batch Burn-In
|
||||||
// -----------------------------------------------------------------------
|
// -----------------------------------------------------------------------
|
||||||
|
|
@ -729,7 +974,11 @@
|
||||||
});
|
});
|
||||||
var data = await resp.json();
|
var data = await resp.json();
|
||||||
if (!resp.ok) {
|
if (!resp.ok) {
|
||||||
showToast(data.detail || 'Failed to queue batch', 'error');
|
if (_handlePoolLockedError(data.detail)) {
|
||||||
|
closeBatchModal();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
showToast(_extractErrorMessage(data.detail) || 'Failed to queue batch', 'error');
|
||||||
if (btn) btn.disabled = false;
|
if (btn) btn.disabled = false;
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
@ -738,10 +987,17 @@
|
||||||
checkedDriveIds.clear();
|
checkedDriveIds.clear();
|
||||||
updateBatchBar();
|
updateBatchBar();
|
||||||
var queued = (data.queued || []).length;
|
var queued = (data.queued || []).length;
|
||||||
var errors = (data.errors || []).length;
|
var allErrors = data.errors || [];
|
||||||
var msg = queued + ' burn-in(s) queued';
|
var poolLocked = allErrors.filter(function (e) { return e && e.pool_locked; });
|
||||||
if (errors) msg += ', ' + errors + ' skipped (already active)';
|
var alreadyActive = allErrors.length - poolLocked.length;
|
||||||
showToast(msg, errors && !queued ? 'error' : 'success');
|
|
||||||
|
var parts = [queued + ' burn-in(s) queued'];
|
||||||
|
if (alreadyActive) parts.push(alreadyActive + ' skipped (already active)');
|
||||||
|
if (poolLocked.length) {
|
||||||
|
parts.push(poolLocked.length + ' pool-locked (use Unlock on each row)');
|
||||||
|
}
|
||||||
|
var tone = (queued === 0 && allErrors.length) ? 'error' : 'success';
|
||||||
|
showToast(parts.join(', '), tone);
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
showToast('Network error', 'error');
|
showToast('Network error', 'error');
|
||||||
if (btn) btn.disabled = false;
|
if (btn) btn.disabled = false;
|
||||||
|
|
@ -792,6 +1048,10 @@
|
||||||
var cancelSmartBtn = e.target.closest('.btn-cancel-smart');
|
var cancelSmartBtn = e.target.closest('.btn-cancel-smart');
|
||||||
if (cancelSmartBtn && !cancelSmartBtn.disabled) { cancelSmartTest(cancelSmartBtn); return; }
|
if (cancelSmartBtn && !cancelSmartBtn.disabled) { cancelSmartTest(cancelSmartBtn); return; }
|
||||||
|
|
||||||
|
// Pool-drive unlock button (single drive)
|
||||||
|
var unlockBtn = e.target.closest('.btn-unlock');
|
||||||
|
if (unlockBtn && !unlockBtn.disabled) { openUnlockModal(unlockBtn); return; }
|
||||||
|
|
||||||
// Burn-in start button (single drive)
|
// Burn-in start button (single drive)
|
||||||
var startBtn = e.target.closest('.btn-start');
|
var startBtn = e.target.closest('.btn-start');
|
||||||
if (startBtn && !startBtn.disabled) { openModal(startBtn); return; }
|
if (startBtn && !startBtn.disabled) { openModal(startBtn); return; }
|
||||||
|
|
@ -820,6 +1080,14 @@
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Unlock modal
|
||||||
|
if (e.target.closest('#unlock-modal-close-btn') || e.target.closest('#unlock-modal-cancel-btn')) {
|
||||||
|
closeUnlockModal();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (e.target.id === 'unlock-modal') { closeUnlockModal(); return; }
|
||||||
|
if (e.target.id === 'unlock-modal-submit-btn') { submitUnlock(); return; }
|
||||||
|
|
||||||
// Batch modal close
|
// Batch modal close
|
||||||
if (e.target.closest('#batch-modal-close-btn') || e.target.closest('#batch-modal-cancel-btn')) {
|
if (e.target.closest('#batch-modal-close-btn') || e.target.closest('#batch-modal-cancel-btn')) {
|
||||||
closeBatchModal();
|
closeBatchModal();
|
||||||
|
|
@ -837,11 +1105,15 @@
|
||||||
|
|
||||||
document.addEventListener('input', function (e) {
|
document.addEventListener('input', function (e) {
|
||||||
var id = e.target.id;
|
var id = e.target.id;
|
||||||
|
if (id === 'unlock-operator-input' || id === 'unlock-reason-input' ||
|
||||||
|
id === 'unlock-confirm-input') validateUnlockModal();
|
||||||
if (id === 'operator-input' || id === 'confirm-serial') validateModal();
|
if (id === 'operator-input' || id === 'confirm-serial') validateModal();
|
||||||
});
|
});
|
||||||
|
|
||||||
document.addEventListener('keydown', function (e) {
|
document.addEventListener('keydown', function (e) {
|
||||||
if (e.key === 'Escape') {
|
if (e.key === 'Escape') {
|
||||||
|
var uModal = document.getElementById('unlock-modal');
|
||||||
|
if (uModal && !uModal.hidden) { closeUnlockModal(); return; }
|
||||||
var modal = document.getElementById('start-modal');
|
var modal = document.getElementById('start-modal');
|
||||||
if (modal && !modal.hidden) { closeModal(); return; }
|
if (modal && !modal.hidden) { closeModal(); return; }
|
||||||
var bModal = document.getElementById('batch-modal');
|
var bModal = document.getElementById('batch-modal');
|
||||||
|
|
@ -1117,14 +1389,6 @@
|
||||||
document.querySelectorAll('.drawer-panel').forEach(function (p) {
|
document.querySelectorAll('.drawer-panel').forEach(function (p) {
|
||||||
p.classList.toggle('active', p.id === 'drawer-panel-' + _drawerTab);
|
p.classList.toggle('active', p.id === 'drawer-panel-' + _drawerTab);
|
||||||
});
|
});
|
||||||
// Terminal tab: init/fit on activation; hide autoscroll (N/A for terminal)
|
|
||||||
var asl = document.querySelector('.autoscroll-label');
|
|
||||||
if (_drawerTab === 'terminal') {
|
|
||||||
if (asl) asl.style.visibility = 'hidden';
|
|
||||||
openTerminalTab();
|
|
||||||
} else {
|
|
||||||
if (asl) asl.style.visibility = '';
|
|
||||||
}
|
|
||||||
});
|
});
|
||||||
|
|
||||||
// Close button
|
// Close button
|
||||||
|
|
@ -1149,155 +1413,4 @@
|
||||||
}).catch(function () { showToast('Network error', 'error'); });
|
}).catch(function () { showToast('Network error', 'error'); });
|
||||||
});
|
});
|
||||||
|
|
||||||
// -----------------------------------------------------------------------
|
|
||||||
// Live Terminal (xterm.js + SSH WebSocket)
|
|
||||||
// -----------------------------------------------------------------------
|
|
||||||
|
|
||||||
var _xtermReady = false; // xterm.js + FitAddon libraries loaded
|
|
||||||
var _terminal = null; // xterm.js Terminal instance
|
|
||||||
var _termFit = null; // FitAddon instance
|
|
||||||
var _termWs = null; // active WebSocket (null = disconnected)
|
|
||||||
|
|
||||||
function _loadXtermLibs(cb) {
|
|
||||||
var link = document.createElement('link');
|
|
||||||
link.rel = 'stylesheet';
|
|
||||||
link.href = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/css/xterm.css';
|
|
||||||
document.head.appendChild(link);
|
|
||||||
|
|
||||||
var s1 = document.createElement('script');
|
|
||||||
s1.src = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/lib/xterm.js';
|
|
||||||
s1.onload = function () {
|
|
||||||
var s2 = document.createElement('script');
|
|
||||||
s2.src = 'https://cdn.jsdelivr.net/npm/xterm-addon-fit@0.8.0/lib/xterm-addon-fit.js';
|
|
||||||
s2.onload = cb;
|
|
||||||
document.head.appendChild(s2);
|
|
||||||
};
|
|
||||||
document.head.appendChild(s1);
|
|
||||||
}
|
|
||||||
|
|
||||||
function openTerminalTab() {
|
|
||||||
var panel = document.getElementById('drawer-panel-terminal');
|
|
||||||
if (!panel) return;
|
|
||||||
|
|
||||||
if (!_xtermReady) {
|
|
||||||
panel.innerHTML = '<div class="drawer-loading">Loading terminal\u2026</div>';
|
|
||||||
_loadXtermLibs(function () {
|
|
||||||
_xtermReady = true;
|
|
||||||
_termInit(panel);
|
|
||||||
});
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (!_terminal) {
|
|
||||||
_termInit(panel);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Already initialised — refit to current panel dimensions
|
|
||||||
setTimeout(function () {
|
|
||||||
if (_termFit) try { _termFit.fit(); } catch (_) {}
|
|
||||||
}, 30);
|
|
||||||
}
|
|
||||||
|
|
||||||
function _termInit(panel) {
|
|
||||||
panel.innerHTML = '';
|
|
||||||
|
|
||||||
var term = new Terminal({
|
|
||||||
cursorBlink: true,
|
|
||||||
fontSize: 13,
|
|
||||||
fontFamily: '"SF Mono","Fira Code",Consolas,"DejaVu Sans Mono",monospace',
|
|
||||||
theme: {
|
|
||||||
background: '#0d1117',
|
|
||||||
foreground: '#e6edf3',
|
|
||||||
cursor: '#58a6ff',
|
|
||||||
cursorAccent: '#0d1117',
|
|
||||||
selectionBackground: 'rgba(88,166,255,0.25)',
|
|
||||||
black: '#484f58', red: '#ff7b72', green: '#3fb950', yellow: '#d29922',
|
|
||||||
blue: '#58a6ff', magenta: '#bc8cff', cyan: '#39c5cf', white: '#b1bac4',
|
|
||||||
brightBlack: '#6e7681', brightRed: '#ffa198', brightGreen: '#56d364',
|
|
||||||
brightYellow: '#e3b341', brightBlue: '#79c0ff', brightMagenta: '#d2a8ff',
|
|
||||||
brightCyan: '#56d4dd', brightWhite: '#f0f6fc',
|
|
||||||
},
|
|
||||||
scrollback: 2000,
|
|
||||||
allowProposedApi: true,
|
|
||||||
});
|
|
||||||
|
|
||||||
var fit = new FitAddon.FitAddon();
|
|
||||||
term.loadAddon(fit);
|
|
||||||
term.open(panel);
|
|
||||||
|
|
||||||
_terminal = term;
|
|
||||||
_termFit = fit;
|
|
||||||
|
|
||||||
// Initial fit after the panel is visible
|
|
||||||
setTimeout(function () {
|
|
||||||
if (_termFit) try { _termFit.fit(); } catch (_) {}
|
|
||||||
}, 30);
|
|
||||||
|
|
||||||
// Forward all keystrokes → SSH (onData registered once here)
|
|
||||||
term.onData(function (data) {
|
|
||||||
if (_termWs && _termWs.readyState === 1) {
|
|
||||||
_termWs.send(new TextEncoder().encode(data));
|
|
||||||
}
|
|
||||||
});
|
|
||||||
|
|
||||||
// Refit + notify server on resize
|
|
||||||
new ResizeObserver(function () {
|
|
||||||
if (!_termFit) return;
|
|
||||||
try { _termFit.fit(); } catch (_) {}
|
|
||||||
if (_termWs && _termWs.readyState === 1 && _terminal) {
|
|
||||||
_termWs.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
|
|
||||||
}
|
|
||||||
}).observe(panel);
|
|
||||||
|
|
||||||
_termConnect();
|
|
||||||
}
|
|
||||||
|
|
||||||
function _termConnect() {
|
|
||||||
if (_termWs && _termWs.readyState <= 1) return; // already open or connecting
|
|
||||||
|
|
||||||
var proto = location.protocol === 'https:' ? 'wss:' : 'ws:';
|
|
||||||
var ws = new WebSocket(proto + '//' + location.host + '/ws/terminal');
|
|
||||||
ws.binaryType = 'arraybuffer';
|
|
||||||
_termWs = ws;
|
|
||||||
|
|
||||||
ws.onopen = function () {
|
|
||||||
_termHideReconnect();
|
|
||||||
if (_terminal && ws.readyState === 1) {
|
|
||||||
ws.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
ws.onmessage = function (e) {
|
|
||||||
if (!_terminal) return;
|
|
||||||
_terminal.write(e.data instanceof ArrayBuffer ? new Uint8Array(e.data) : e.data);
|
|
||||||
};
|
|
||||||
|
|
||||||
ws.onclose = function () {
|
|
||||||
if (_terminal) _terminal.write('\r\n\x1b[33m\u2500\u2500 disconnected \u2500\u2500\x1b[0m\r\n');
|
|
||||||
_termShowReconnect();
|
|
||||||
};
|
|
||||||
|
|
||||||
ws.onerror = function () { /* onclose fires too */ };
|
|
||||||
}
|
|
||||||
|
|
||||||
function _termShowReconnect() {
|
|
||||||
var panel = document.getElementById('drawer-panel-terminal');
|
|
||||||
if (!panel || panel.querySelector('.term-reconnect-bar')) return;
|
|
||||||
var bar = document.createElement('div');
|
|
||||||
bar.className = 'term-reconnect-bar';
|
|
||||||
bar.innerHTML = '<span>Connection closed</span>'
|
|
||||||
+ '<button class="btn-secondary">\u21ba Reconnect</button>';
|
|
||||||
bar.querySelector('button').onclick = function () {
|
|
||||||
bar.remove();
|
|
||||||
_termConnect();
|
|
||||||
};
|
|
||||||
panel.appendChild(bar);
|
|
||||||
}
|
|
||||||
|
|
||||||
function _termHideReconnect() {
|
|
||||||
var bar = document.querySelector('.term-reconnect-bar');
|
|
||||||
if (bar) bar.remove();
|
|
||||||
}
|
|
||||||
|
|
||||||
}());
|
}());
|
||||||
|
|
|
||||||
|
|
@ -80,11 +80,14 @@
|
||||||
{%- set bi_active = drive.burnin and drive.burnin.state in ('queued', 'running') %}
|
{%- set bi_active = drive.burnin and drive.burnin.state in ('queued', 'running') %}
|
||||||
{%- set short_busy = drive.smart_short and drive.smart_short.state == 'running' %}
|
{%- set short_busy = drive.smart_short and drive.smart_short.state == 'running' %}
|
||||||
{%- set long_busy = drive.smart_long and drive.smart_long.state == 'running' %}
|
{%- set long_busy = drive.smart_long and drive.smart_long.state == 'running' %}
|
||||||
{%- set selectable = not bi_active and not short_busy and not long_busy %}
|
{%- set pool_locked = drive.pool_name and not drive.pool_unlocked_until %}
|
||||||
|
{%- set is_boot_pool = drive.pool_name == 'boot-pool' %}
|
||||||
|
{%- set is_exported = drive.pool_role == 'exported' %}
|
||||||
|
{%- set selectable = not bi_active and not short_busy and not long_busy and not pool_locked %}
|
||||||
{%- set bi_done = drive.burnin and drive.burnin.state in ('passed', 'failed', 'cancelled', 'unknown') %}
|
{%- set bi_done = drive.burnin and drive.burnin.state in ('passed', 'failed', 'cancelled', 'unknown') %}
|
||||||
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
|
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
|
||||||
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
|
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
|
||||||
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy %}
|
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %}
|
||||||
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}">
|
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}">
|
||||||
<td class="col-check">
|
<td class="col-check">
|
||||||
{%- if selectable %}
|
{%- if selectable %}
|
||||||
|
|
@ -92,8 +95,18 @@
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
</td>
|
</td>
|
||||||
<td class="col-drive">
|
<td class="col-drive">
|
||||||
<span class="drive-name">{{ drive.devname }}</span>
|
<span class="drive-name">
|
||||||
|
{%- if drive.pool_name -%}
|
||||||
|
<span class="pool-lock-icon{% if is_boot_pool %} pool-lock-boot{% elif is_exported %} pool-lock-exported{% endif %}"
|
||||||
|
title="{% if is_boot_pool %}In BOOT POOL '{{ drive.pool_name }}'{% elif is_exported %}Carries ZFS data from a previously-imported pool{% else %}In pool '{{ drive.pool_name }}' ({{ drive.pool_role or 'data' }}){% endif %}">🔒</span>
|
||||||
|
{%- endif -%}
|
||||||
|
{{ drive.devname }}
|
||||||
|
</span>
|
||||||
<span class="drive-model">{{ drive.model or "Unknown" }}</span>
|
<span class="drive-model">{{ drive.model or "Unknown" }}</span>
|
||||||
|
{%- if drive.pool_name %}
|
||||||
|
<span class="pool-pill{% if is_boot_pool %} pool-pill-boot{% elif is_exported %} pool-pill-exported{% endif %}"
|
||||||
|
title="ZFS pool membership">{% if is_exported %}exported ZFS{% else %}{{ drive.pool_name }} · {{ drive.pool_role or 'data' }}{% endif %}</span>
|
||||||
|
{%- endif %}
|
||||||
{%- if drive.location %}
|
{%- if drive.location %}
|
||||||
<span class="drive-location"
|
<span class="drive-location"
|
||||||
data-drive-id="{{ drive.id }}"
|
data-drive-id="{{ drive.id }}"
|
||||||
|
|
@ -154,6 +167,20 @@
|
||||||
{% if short_busy %}disabled{% endif %}
|
{% if short_busy %}disabled{% endif %}
|
||||||
title="Start Long SMART test (~several hours)">Long</button>
|
title="Start Long SMART test (~several hours)">Long</button>
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
|
{%- if pool_locked %}
|
||||||
|
<!-- Drive is in a zpool — replace Burn-In with Unlock affordance -->
|
||||||
|
<button class="btn-action btn-unlock{% if is_boot_pool %} btn-unlock-boot{% elif is_exported %} btn-unlock-exported{% endif %}"
|
||||||
|
data-drive-id="{{ drive.id }}"
|
||||||
|
data-devname="{{ drive.devname }}"
|
||||||
|
data-serial="{{ drive.serial or '' }}"
|
||||||
|
data-model="{{ drive.model or 'Unknown' }}"
|
||||||
|
data-size="{{ drive.size_bytes | format_bytes }}"
|
||||||
|
data-pool-name="{{ drive.pool_name }}"
|
||||||
|
data-pool-role="{{ drive.pool_role or 'data' }}"
|
||||||
|
data-is-boot-pool="{{ '1' if is_boot_pool else '0' }}"
|
||||||
|
data-is-exported="{{ '1' if is_exported else '0' }}"
|
||||||
|
title="{% if is_boot_pool %}Drive is in BOOT POOL '{{ drive.pool_name }}' — click to unlock{% elif is_exported %}Drive carries ZFS data from a previously-imported pool — click to unlock{% else %}Drive is in pool '{{ drive.pool_name }}' — click to unlock{% endif %}">🔒 Unlock</button>
|
||||||
|
{%- else %}
|
||||||
<!-- Burn-In -->
|
<!-- Burn-In -->
|
||||||
<button class="btn-action btn-start{% if short_busy or long_busy %} btn-disabled{% endif %}"
|
<button class="btn-action btn-start{% if short_busy or long_busy %} btn-disabled{% endif %}"
|
||||||
data-drive-id="{{ drive.id }}"
|
data-drive-id="{{ drive.id }}"
|
||||||
|
|
@ -162,8 +189,10 @@
|
||||||
data-model="{{ drive.model or 'Unknown' }}"
|
data-model="{{ drive.model or 'Unknown' }}"
|
||||||
data-size="{{ drive.size_bytes | format_bytes }}"
|
data-size="{{ drive.size_bytes | format_bytes }}"
|
||||||
data-health="{{ drive.smart_health }}"
|
data-health="{{ drive.smart_health }}"
|
||||||
|
data-pool-name="{{ drive.pool_name or '' }}"
|
||||||
|
data-pool-unlocked-until="{{ drive.pool_unlocked_until or '' }}"
|
||||||
{% if short_busy or long_busy %}disabled{% endif %}
|
{% if short_busy or long_busy %}disabled{% endif %}
|
||||||
title="Start Burn-In">Burn-In</button>
|
title="Start Burn-In{% if drive.pool_name %} (UNLOCKED — pool drive){% endif %}">Burn-In{% if drive.pool_name %} <span class="unlock-countdown" data-expires="{{ drive.pool_unlocked_until }}">🔓</span>{% endif %}</button>
|
||||||
<!-- Reset — clears SMART state so drive can be re-tested from scratch -->
|
<!-- Reset — clears SMART state so drive can be re-tested from scratch -->
|
||||||
{%- if can_reset %}
|
{%- if can_reset %}
|
||||||
<button class="btn-action btn-reset"
|
<button class="btn-action btn-reset"
|
||||||
|
|
@ -171,6 +200,7 @@
|
||||||
title="Reset SMART state — clears test results so drive shows as fresh">Reset</button>
|
title="Reset SMART state — clears test results so drive shows as fresh">Reset</button>
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
|
{%- endif %}
|
||||||
</div>
|
</div>
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
|
|
||||||
69
app/templates/components/modal_unlock.html
Normal file
69
app/templates/components/modal_unlock.html
Normal file
|
|
@ -0,0 +1,69 @@
|
||||||
|
<div id="unlock-modal" class="modal-overlay" hidden aria-modal="true" role="dialog">
|
||||||
|
<div class="modal modal-danger">
|
||||||
|
<div class="modal-header">
|
||||||
|
<h2 class="modal-title" id="unlock-modal-title">Unlock pool drive</h2>
|
||||||
|
<button class="modal-close" id="unlock-modal-close-btn" aria-label="Close">✕</button>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="modal-body">
|
||||||
|
|
||||||
|
<div class="modal-drive-info">
|
||||||
|
<div class="modal-drive-row">
|
||||||
|
<span class="modal-devname" id="unlock-devname">—</span>
|
||||||
|
<span class="chip" id="unlock-pool-chip">—</span>
|
||||||
|
</div>
|
||||||
|
<div class="modal-drive-sub">
|
||||||
|
<span id="unlock-model">—</span>
|
||||||
|
·
|
||||||
|
<span id="unlock-size">—</span>
|
||||||
|
·
|
||||||
|
<span class="mono" id="unlock-serial">—</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div id="unlock-warning" class="confirm-warning">
|
||||||
|
<strong id="unlock-warning-title">This drive belongs to a zpool.</strong>
|
||||||
|
<p id="unlock-warning-body">
|
||||||
|
Running a destructive burn-in stage will overwrite all data on this drive
|
||||||
|
and almost certainly destroy the pool. Only proceed if you have already
|
||||||
|
removed this drive from the pool, or if you are intentionally
|
||||||
|
decommissioning the pool.
|
||||||
|
</p>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="form-group">
|
||||||
|
<label class="form-label" for="unlock-operator-input">Operator</label>
|
||||||
|
<input class="form-input" type="text" id="unlock-operator-input"
|
||||||
|
placeholder="Your name" autocomplete="name" maxlength="64">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="form-group">
|
||||||
|
<label class="form-label" for="unlock-reason-input">
|
||||||
|
Reason (recorded to audit log, minimum 5 characters)
|
||||||
|
</label>
|
||||||
|
<input class="form-input" type="text" id="unlock-reason-input"
|
||||||
|
placeholder="e.g. replacing failed drive in tank/raidz2-0"
|
||||||
|
autocomplete="off" maxlength="200">
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="form-group">
|
||||||
|
<label class="form-label" for="unlock-confirm-input" id="unlock-confirm-label">
|
||||||
|
Type <code id="unlock-confirm-token">—</code> to confirm
|
||||||
|
</label>
|
||||||
|
<input class="form-input form-input-confirm" type="text" id="unlock-confirm-input"
|
||||||
|
placeholder="" autocomplete="off" spellcheck="false">
|
||||||
|
<div class="confirm-hint" id="unlock-confirm-hint"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="stage-always-note">
|
||||||
|
Unlock lasts 10 minutes. After that, this drive locks again automatically.
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="modal-footer">
|
||||||
|
<button class="btn-secondary" id="unlock-modal-cancel-btn">Cancel</button>
|
||||||
|
<button class="btn-danger" id="unlock-modal-submit-btn" disabled>Unlock</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
@ -5,8 +5,9 @@
|
||||||
{% block content %}
|
{% block content %}
|
||||||
{% include "components/modal_start.html" %}
|
{% include "components/modal_start.html" %}
|
||||||
{% include "components/modal_batch.html" %}
|
{% include "components/modal_batch.html" %}
|
||||||
|
{% include "components/modal_unlock.html" %}
|
||||||
|
|
||||||
<!-- Stats bar — counts are updated live by app.js updateCounts() -->
|
<!-- Stats bar — drive counts updated live by app.js updateCounts(); sensor chips updated by SSE system-sensors event -->
|
||||||
<div class="stats-bar">
|
<div class="stats-bar">
|
||||||
<div class="stat-card" data-stat-filter="all">
|
<div class="stat-card" data-stat-filter="all">
|
||||||
<span class="stat-value" id="stat-all">{{ drives | length }}</span>
|
<span class="stat-value" id="stat-all">{{ drives | length }}</span>
|
||||||
|
|
@ -28,6 +29,33 @@
|
||||||
<span class="stat-value" id="stat-idle">0</span>
|
<span class="stat-value" id="stat-idle">0</span>
|
||||||
<span class="stat-label">Idle</span>
|
<span class="stat-label">Idle</span>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
{%- set st = poller.system_temps if (poller and poller.system_temps) else {} %}
|
||||||
|
{%- if st.get('cpu_c') is not none or st.get('pch_c') is not none %}
|
||||||
|
<div class="stats-bar-sep"></div>
|
||||||
|
{%- if st.get('cpu_c') is not none %}
|
||||||
|
<div class="stat-sensor" id="sensor-cpu">
|
||||||
|
<span class="stat-sensor-val {{ st.get('cpu_c') | temp_class }}" id="sensor-cpu-val">{{ st.get('cpu_c') }}°</span>
|
||||||
|
<span class="stat-sensor-label">CPU</span>
|
||||||
|
</div>
|
||||||
|
{%- endif %}
|
||||||
|
{%- if st.get('pch_c') is not none %}
|
||||||
|
<div class="stat-sensor" id="sensor-pch">
|
||||||
|
<span class="stat-sensor-val {{ st.get('pch_c') | temp_class }}" id="sensor-pch-val">{{ st.get('pch_c') }}°</span>
|
||||||
|
<span class="stat-sensor-label">PCH</span>
|
||||||
|
</div>
|
||||||
|
{%- endif %}
|
||||||
|
{%- endif %}
|
||||||
|
|
||||||
|
{%- set tp = poller.thermal_pressure if poller else 'ok' %}
|
||||||
|
<div class="stat-sensor stat-sensor-thermal stat-sensor-thermal-{{ tp }}"
|
||||||
|
id="sensor-thermal"
|
||||||
|
{% if not tp or tp == 'ok' %}hidden{% endif %}>
|
||||||
|
<span class="stat-sensor-val" id="sensor-thermal-val">
|
||||||
|
{%- if tp == 'warn' %}WARM{%- elif tp == 'crit' %}HOT{%- else %}OK{%- endif %}
|
||||||
|
</span>
|
||||||
|
<span class="stat-sensor-label">Thermal</span>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- Failed drive banner — shown/hidden by JS when failed count > 0 -->
|
<!-- Failed drive banner — shown/hidden by JS when failed count > 0 -->
|
||||||
|
|
@ -83,7 +111,6 @@
|
||||||
<button class="drawer-tab active" data-tab="burnin">Burn-In</button>
|
<button class="drawer-tab active" data-tab="burnin">Burn-In</button>
|
||||||
<button class="drawer-tab" data-tab="smart">SMART</button>
|
<button class="drawer-tab" data-tab="smart">SMART</button>
|
||||||
<button class="drawer-tab" data-tab="events">Events</button>
|
<button class="drawer-tab" data-tab="events">Events</button>
|
||||||
<button class="drawer-tab" data-tab="terminal">Terminal</button>
|
|
||||||
</nav>
|
</nav>
|
||||||
<div class="drawer-controls">
|
<div class="drawer-controls">
|
||||||
<label class="autoscroll-label">
|
<label class="autoscroll-label">
|
||||||
|
|
@ -97,7 +124,6 @@
|
||||||
<div class="drawer-panel active" id="drawer-panel-burnin"></div>
|
<div class="drawer-panel active" id="drawer-panel-burnin"></div>
|
||||||
<div class="drawer-panel" id="drawer-panel-smart"></div>
|
<div class="drawer-panel" id="drawer-panel-smart"></div>
|
||||||
<div class="drawer-panel" id="drawer-panel-events"></div>
|
<div class="drawer-panel" id="drawer-panel-events"></div>
|
||||||
<div class="drawer-panel drawer-panel-terminal" id="drawer-panel-terminal"></div>
|
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
{% endblock %}
|
{% endblock %}
|
||||||
|
|
|
||||||
|
|
@ -248,6 +248,30 @@
|
||||||
type="number" min="0" max="9999" value="{{ editable.bad_block_threshold }}">
|
type="number" min="0" max="9999" value="{{ editable.bad_block_threshold }}">
|
||||||
<span class="sf-hint">Max bad blocks before surface validate fails (Stage 7)</span>
|
<span class="sf-hint">Max bad blocks before surface validate fails (Stage 7)</span>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<div class="sf-row">
|
||||||
|
<label class="sf-label" for="surface_validate_block_size">Badblocks Block Size (bytes)</label>
|
||||||
|
<input class="sf-input sf-input-xs" id="surface_validate_block_size"
|
||||||
|
name="surface_validate_block_size" type="number" min="512" max="1048576" step="512"
|
||||||
|
value="{{ editable.surface_validate_block_size }}">
|
||||||
|
<span class="sf-hint">badblocks -b. 4096 (default) is conservative; 8192 is faster on multi-TB HDDs (~2x RAM, ~half the runtime). Power of 2.</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="sf-row">
|
||||||
|
<label class="sf-label" for="surface_validate_block_buffer">Badblocks Block Buffer</label>
|
||||||
|
<input class="sf-input sf-input-xs" id="surface_validate_block_buffer"
|
||||||
|
name="surface_validate_block_buffer" type="number" min="1" max="4096"
|
||||||
|
value="{{ editable.surface_validate_block_buffer }}">
|
||||||
|
<span class="sf-hint">badblocks -c. 64 (default) matches the upstream tool. Buffer = block_size × this many blocks per IO.</span>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="sf-row">
|
||||||
|
<label class="sf-label" for="surface_validate_passes">Badblocks Passes</label>
|
||||||
|
<input class="sf-input sf-input-xs" id="surface_validate_passes"
|
||||||
|
name="surface_validate_passes" type="number" min="0" max="16"
|
||||||
|
value="{{ editable.surface_validate_passes }}">
|
||||||
|
<span class="sf-hint">badblocks -p. 1 = repeat until one consecutive clean scan (default). 2-3 for paranoid burn-in that re-confirms after errors.</span>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
</div><!-- /right col -->
|
</div><!-- /right col -->
|
||||||
|
|
|
||||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
283
tests/test_pool_parser.py
Normal file
283
tests/test_pool_parser.py
Normal file
|
|
@ -0,0 +1,283 @@
|
||||||
|
"""Unit tests for the zpool-list and lsblk parsers in ssh_client.
|
||||||
|
|
||||||
|
These cover the structural cases that drive the pool-membership lock:
|
||||||
|
mirror/raidz/draid container vdevs, single-disk vdevs at depth 1, the
|
||||||
|
flattened-indentation behaviour of `zpool list -vHP` on TrueNAS, partition
|
||||||
|
suffix stripping for NVMe and SCSI, and the cache/log/spare/special
|
||||||
|
section markers (including plural variants).
|
||||||
|
|
||||||
|
Run with: python -m unittest discover tests/ -v
|
||||||
|
"""
|
||||||
|
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
from app.ssh_client import (
|
||||||
|
_parse_zpool_list_output,
|
||||||
|
_parse_lsblk_zfs_output,
|
||||||
|
_parse_smart_health_batch,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseZpoolList(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_empty_output_returns_empty(self):
|
||||||
|
self.assertEqual(_parse_zpool_list_output(""), {})
|
||||||
|
|
||||||
|
def test_single_pool_with_mirror(self):
|
||||||
|
# TrueNAS-flattened output: pool at depth 0, vdev type and devices
|
||||||
|
# all at depth 1.
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"boot-pool\t232G\t8.4G\t224G\t-\t-\t17%\t3%\t1.00x\tONLINE\t-\n"
|
||||||
|
"\tmirror-0\t232G\t8.4G\t224G\t-\t-\t17%\t3.6%\t-\tONLINE\n"
|
||||||
|
"\t/dev/nvme0n1p3\t232G\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdd3\t232G\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {
|
||||||
|
"nvme0n1": {"pool": "boot-pool", "role": "data"},
|
||||||
|
"sdd": {"pool": "boot-pool", "role": "data"},
|
||||||
|
})
|
||||||
|
|
||||||
|
def test_raidz2_pool(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"tank\t127T\t4.5T\t122T\t-\t-\t0%\t3%\t1.00x\tONLINE\t-\n"
|
||||||
|
"\traidz2-0\t127T\t4.5T\t122T\t-\t-\t0%\t3%\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdc\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sde\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdf\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(set(out.keys()), {"sdc", "sde", "sdf"})
|
||||||
|
for v in out.values():
|
||||||
|
self.assertEqual(v, {"pool": "tank", "role": "data"})
|
||||||
|
|
||||||
|
def test_draid_pool(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"warm\t100T\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tdraid2:8d:10c:1s-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdg\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdh\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sdg"], {"pool": "warm", "role": "data"})
|
||||||
|
self.assertEqual(out["sdh"], {"pool": "warm", "role": "data"})
|
||||||
|
|
||||||
|
def test_single_disk_vdev_at_depth_1(self):
|
||||||
|
# No mirror/raidz wrapper — a `/dev/...` line itself sits at depth 1.
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"scratch\t1T\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\t/dev/sdi\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdi": {"pool": "scratch", "role": "data"}})
|
||||||
|
|
||||||
|
def test_section_markers_switch_role(self):
|
||||||
|
# cache / log / spare / special / dedup all at depth 1; subsequent
|
||||||
|
# /dev/... lines (also at depth 1) inherit that role.
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\tcache\n"
|
||||||
|
"\t/dev/nvme1n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\tlog\n"
|
||||||
|
"\t/dev/nvme2n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\tspare\n"
|
||||||
|
"\t/dev/sdz\t-\t-\t-\t-\t-\t-\t-\t-\tAVAIL\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sda"], {"pool": "tank", "role": "data"})
|
||||||
|
self.assertEqual(out["sdb"], {"pool": "tank", "role": "data"})
|
||||||
|
self.assertEqual(out["nvme1n1"], {"pool": "tank", "role": "cache"})
|
||||||
|
self.assertEqual(out["nvme2n1"], {"pool": "tank", "role": "log"})
|
||||||
|
self.assertEqual(out["sdz"], {"pool": "tank", "role": "spare"})
|
||||||
|
|
||||||
|
def test_section_markers_plurals_normalize(self):
|
||||||
|
# ZFS sometimes emits 'logs'/'spares' instead of 'log'/'spare'.
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tlogs\n"
|
||||||
|
"\t/dev/nvme0n1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\tspares\n"
|
||||||
|
"\t/dev/sdz\t-\t-\t-\t-\t-\t-\t-\t-\tAVAIL\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["nvme0n1"]["role"], "log")
|
||||||
|
self.assertEqual(out["sdz"]["role"], "spare")
|
||||||
|
|
||||||
|
def test_special_and_dedup_section(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tspecial\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\tdedup\n"
|
||||||
|
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sda"]["role"], "special")
|
||||||
|
self.assertEqual(out["sdb"]["role"], "dedup")
|
||||||
|
|
||||||
|
def test_partition_suffix_stripped(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sda3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/nvme0n1p3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertIn("sda", out)
|
||||||
|
self.assertNotIn("sda3", out)
|
||||||
|
self.assertIn("nvme0n1", out)
|
||||||
|
self.assertNotIn("nvme0n1p3", out)
|
||||||
|
|
||||||
|
def test_long_scsi_devname(self):
|
||||||
|
# Past sdz: sdaa, sdab, ...
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"big\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\traidz3-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdaa\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdab1\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sdaa"]["pool"], "big")
|
||||||
|
self.assertEqual(out["sdab"]["pool"], "big") # partition stripped
|
||||||
|
|
||||||
|
def test_pool_name_with_dashes_dots_underscores(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"my-cool_pool.v2\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sda"]["pool"], "my-cool_pool.v2")
|
||||||
|
|
||||||
|
def test_multiple_pools(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"boot-pool\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tmirror-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/nvme0n1p3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdd3\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\traidz2-0\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["nvme0n1"]["pool"], "boot-pool")
|
||||||
|
self.assertEqual(out["sdd"]["pool"], "boot-pool")
|
||||||
|
self.assertEqual(out["sda"]["pool"], "tank")
|
||||||
|
self.assertEqual(out["sdb"]["pool"], "tank")
|
||||||
|
|
||||||
|
def test_pool_role_resets_between_pools(self):
|
||||||
|
# Section marker in pool A must not carry into pool B.
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"a\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\tcache\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
"b\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\t/dev/sdb\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out["sda"]["role"], "cache")
|
||||||
|
self.assertEqual(out["sdb"]["role"], "data")
|
||||||
|
|
||||||
|
def test_blank_lines_skipped(self):
|
||||||
|
out = _parse_zpool_list_output(
|
||||||
|
"\n"
|
||||||
|
"tank\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\t-\n"
|
||||||
|
"\n"
|
||||||
|
"\t/dev/sda\t-\t-\t-\t-\t-\t-\t-\t-\tONLINE\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sda": {"pool": "tank", "role": "data"}})
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseLsblkZfs(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_empty_returns_empty_set(self):
|
||||||
|
self.assertEqual(_parse_lsblk_zfs_output(""), set())
|
||||||
|
|
||||||
|
def test_partition_zfs_member(self):
|
||||||
|
# Typical TrueNAS layout: zpool members are partitions.
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"sda \n"
|
||||||
|
"sda1 \n"
|
||||||
|
"sda3 zfs_member\n"
|
||||||
|
"sdb \n"
|
||||||
|
"sdb3 zfs_member\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sda", "sdb"})
|
||||||
|
|
||||||
|
def test_whole_disk_zfs_member(self):
|
||||||
|
# Some configurations put zfs_member on the whole disk.
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"sdc zfs_member\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdc"})
|
||||||
|
|
||||||
|
def test_nvme_partitioned_and_whole(self):
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"nvme0n1 \n"
|
||||||
|
"nvme0n1p3 zfs_member\n"
|
||||||
|
"nvme1n1 zfs_member\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"nvme0n1", "nvme1n1"})
|
||||||
|
|
||||||
|
def test_non_zfs_fstypes_ignored(self):
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"sda1 ext4\n"
|
||||||
|
"sda2 swap\n"
|
||||||
|
"sdb1 btrfs\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, set())
|
||||||
|
|
||||||
|
def test_long_scsi_devnames(self):
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"sdaa zfs_member\n"
|
||||||
|
"sdab1 zfs_member\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdaa", "sdab"})
|
||||||
|
|
||||||
|
def test_short_lines_skipped(self):
|
||||||
|
out = _parse_lsblk_zfs_output(
|
||||||
|
"sda\n"
|
||||||
|
"\n"
|
||||||
|
"sdb1 zfs_member\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdb"})
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseSmartHealthBatch(unittest.TestCase):
|
||||||
|
|
||||||
|
def test_passed_drive(self):
|
||||||
|
out = _parse_smart_health_batch(
|
||||||
|
"@@sda@@\n"
|
||||||
|
"smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6]\n"
|
||||||
|
"SMART overall-health self-assessment test result: PASSED\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sda": "PASSED"})
|
||||||
|
|
||||||
|
def test_failed_drive(self):
|
||||||
|
out = _parse_smart_health_batch(
|
||||||
|
"@@sdb@@\n"
|
||||||
|
"SMART overall-health self-assessment test result: FAILED!\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdb": "FAILED"})
|
||||||
|
|
||||||
|
def test_unknown_when_no_marker(self):
|
||||||
|
out = _parse_smart_health_batch(
|
||||||
|
"@@sdc@@\n"
|
||||||
|
"/dev/sdc: Unknown USB bridge\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sdc": "UNKNOWN"})
|
||||||
|
|
||||||
|
def test_multiple_drives_mixed_states(self):
|
||||||
|
out = _parse_smart_health_batch(
|
||||||
|
"@@sda@@\n"
|
||||||
|
"SMART overall-health self-assessment test result: PASSED\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
"@@sdb@@\n"
|
||||||
|
"SMART overall-health self-assessment test result: FAILED!\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
"@@nvme0n1@@\n"
|
||||||
|
"SMART overall-health self-assessment test result: PASSED\n"
|
||||||
|
"@@END@@\n"
|
||||||
|
)
|
||||||
|
self.assertEqual(out, {"sda": "PASSED", "sdb": "FAILED", "nvme0n1": "PASSED"})
|
||||||
|
|
||||||
|
def test_empty_returns_empty(self):
|
||||||
|
self.assertEqual(_parse_smart_health_batch(""), {})
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
303
tests/test_unlock_flow.py
Normal file
303
tests/test_unlock_flow.py
Normal file
|
|
@ -0,0 +1,303 @@
|
||||||
|
"""Unit tests for the pool-drive unlock state machine in burnin.py.
|
||||||
|
|
||||||
|
Covers: token validation per pool kind, identity-binding (grant
|
||||||
|
invalidated when pool_name/pool_role changes), TTL expiry, the
|
||||||
|
audit-commit-then-arm ordering (a failing audit insert leaves no
|
||||||
|
in-memory grant), and the unique-active-burnin partial index that
|
||||||
|
prevents duplicate queued rows for the same drive.
|
||||||
|
|
||||||
|
Uses an in-memory SQLite DB and monkeypatches app.config.settings.db_path.
|
||||||
|
No SSH, no network, no FastAPI.
|
||||||
|
|
||||||
|
Run with: python -m unittest discover tests/ -v
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
import time
|
||||||
|
import unittest
|
||||||
|
|
||||||
|
import aiosqlite
|
||||||
|
|
||||||
|
|
||||||
|
async def _setup_temp_db() -> str:
|
||||||
|
"""Create a temp SQLite file, point app.config at it, init schema.
|
||||||
|
Async-callable from IsolatedAsyncioTestCase.asyncSetUp."""
|
||||||
|
fd, path = tempfile.mkstemp(suffix=".db")
|
||||||
|
os.close(fd)
|
||||||
|
from app.config import settings
|
||||||
|
settings.db_path = path
|
||||||
|
|
||||||
|
from app.database import init_db
|
||||||
|
await init_db()
|
||||||
|
# Seed pool drives so unlock_flow tests have something to grant on.
|
||||||
|
async with aiosqlite.connect(path) as db:
|
||||||
|
await db.execute("""
|
||||||
|
INSERT INTO drives
|
||||||
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
||||||
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
||||||
|
pool_name, pool_role, pool_seen_at)
|
||||||
|
VALUES ('test-id-1', 'sda', 'TESTSER1', 'TestModel', 1000,
|
||||||
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
||||||
|
'2026-05-02T00:00:00+00:00',
|
||||||
|
'tank', 'data', '2026-05-02T00:00:00+00:00')
|
||||||
|
""")
|
||||||
|
await db.execute("""
|
||||||
|
INSERT INTO drives
|
||||||
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
||||||
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
||||||
|
pool_name, pool_role, pool_seen_at)
|
||||||
|
VALUES ('test-id-2', 'sdb', 'TESTSER2', 'TestModel', 1000,
|
||||||
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
||||||
|
'2026-05-02T00:00:00+00:00',
|
||||||
|
'boot-pool', 'data', '2026-05-02T00:00:00+00:00')
|
||||||
|
""")
|
||||||
|
await db.execute("""
|
||||||
|
INSERT INTO drives
|
||||||
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
||||||
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
||||||
|
pool_name, pool_role, pool_seen_at)
|
||||||
|
VALUES ('test-id-3', 'sdc', 'TESTSER3', 'TestModel', 1000,
|
||||||
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
||||||
|
'2026-05-02T00:00:00+00:00',
|
||||||
|
'(exported)', 'exported', '2026-05-02T00:00:00+00:00')
|
||||||
|
""")
|
||||||
|
await db.commit()
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
class TestUnlockFlow(unittest.IsolatedAsyncioTestCase):
|
||||||
|
|
||||||
|
async def asyncSetUp(self):
|
||||||
|
self.db_path = await _setup_temp_db()
|
||||||
|
# Reset module state so previous test runs don't bleed in.
|
||||||
|
from app import burnin
|
||||||
|
burnin._unlock_grants.clear()
|
||||||
|
|
||||||
|
async def asyncTearDown(self):
|
||||||
|
try:
|
||||||
|
os.unlink(self.db_path)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# ----- token validation per pool kind -----
|
||||||
|
|
||||||
|
async def test_active_pool_token_is_pool_name(self):
|
||||||
|
from app import burnin
|
||||||
|
# Drive 1 = tank/data
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(1, "wrong", "op", "valid reason")
|
||||||
|
expiry = await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
self.assertGreater(expiry, time.time())
|
||||||
|
|
||||||
|
async def test_boot_pool_token_is_destroy_phrase(self):
|
||||||
|
from app import burnin
|
||||||
|
# Drive 2 = boot-pool — typing the pool name must NOT work.
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(2, "boot-pool", "op", "valid reason")
|
||||||
|
expiry = await burnin.grant_pool_unlock(
|
||||||
|
2, "DESTROY BOOT POOL", "op", "valid reason"
|
||||||
|
)
|
||||||
|
self.assertGreater(expiry, time.time())
|
||||||
|
|
||||||
|
async def test_exported_token_is_destroy_phrase(self):
|
||||||
|
from app import burnin
|
||||||
|
# Drive 3 = (exported)/exported
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(3, "(exported)", "op", "valid reason")
|
||||||
|
expiry = await burnin.grant_pool_unlock(
|
||||||
|
3, "DESTROY EXPORTED POOL", "op", "valid reason"
|
||||||
|
)
|
||||||
|
self.assertGreater(expiry, time.time())
|
||||||
|
|
||||||
|
# ----- input validation -----
|
||||||
|
|
||||||
|
async def test_empty_reason_rejected(self):
|
||||||
|
from app import burnin
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "")
|
||||||
|
|
||||||
|
async def test_short_reason_rejected(self):
|
||||||
|
from app import burnin
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "hi")
|
||||||
|
|
||||||
|
async def test_empty_operator_rejected(self):
|
||||||
|
from app import burnin
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "", "valid reason")
|
||||||
|
|
||||||
|
async def test_unknown_drive_rejected(self):
|
||||||
|
from app import burnin
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(99999, "anything", "op", "valid reason")
|
||||||
|
|
||||||
|
async def test_drive_not_in_pool_rejected(self):
|
||||||
|
from app import burnin
|
||||||
|
# Manually clear pool fields on drive 1
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
|
||||||
|
await db.commit()
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
|
||||||
|
# ----- identity binding (Codex finding #2) -----
|
||||||
|
|
||||||
|
async def test_grant_invalidated_when_pool_name_changes(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
# Operator's grant references tank/data; pool detection now reports tank2.
|
||||||
|
self.assertTrue(burnin._is_unlocked(1, "tank", "data"))
|
||||||
|
self.assertFalse(burnin._is_unlocked(1, "tank2", "data"))
|
||||||
|
# And the side effect: the grant is reaped, not just temporarily denied.
|
||||||
|
self.assertNotIn(1, burnin._unlock_grants)
|
||||||
|
|
||||||
|
async def test_grant_invalidated_when_pool_role_changes(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
# Same pool, different role (data -> cache).
|
||||||
|
self.assertFalse(burnin._is_unlocked(1, "tank", "cache"))
|
||||||
|
self.assertNotIn(1, burnin._unlock_grants)
|
||||||
|
|
||||||
|
async def test_unlock_expiry_returns_none_for_mismatched_identity(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
self.assertIsNotNone(burnin.unlock_expiry(1, "tank", "data"))
|
||||||
|
self.assertIsNone(burnin.unlock_expiry(1, "tank2", "data"))
|
||||||
|
|
||||||
|
# ----- TTL expiry -----
|
||||||
|
|
||||||
|
async def test_expired_grant_returns_false(self):
|
||||||
|
from app import burnin
|
||||||
|
# Drop TTL to 0 so the grant is born expired.
|
||||||
|
original = burnin.UNLOCK_TTL_SECONDS
|
||||||
|
burnin.UNLOCK_TTL_SECONDS = 0
|
||||||
|
try:
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
||||||
|
self.assertFalse(burnin._is_unlocked(1, "tank", "data"))
|
||||||
|
self.assertNotIn(1, burnin._unlock_grants)
|
||||||
|
finally:
|
||||||
|
burnin.UNLOCK_TTL_SECONDS = original
|
||||||
|
|
||||||
|
# ----- audit commit ordering (Codex finding #3) -----
|
||||||
|
|
||||||
|
async def test_audit_event_recorded_for_active_pool(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(1, "tank", "alice", "swapping out drive")
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
db.row_factory = aiosqlite.Row
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT event_type, operator, message FROM audit_events "
|
||||||
|
"WHERE drive_id=? ORDER BY id DESC LIMIT 1", (1,)
|
||||||
|
)
|
||||||
|
row = await cur.fetchone()
|
||||||
|
self.assertEqual(row["event_type"], "pool_drive_unlocked")
|
||||||
|
self.assertEqual(row["operator"], "alice")
|
||||||
|
self.assertIn("swapping out drive", row["message"])
|
||||||
|
|
||||||
|
async def test_audit_event_for_boot_pool_uses_distinct_type(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(
|
||||||
|
2, "DESTROY BOOT POOL", "alice", "replacing failed mirror"
|
||||||
|
)
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
db.row_factory = aiosqlite.Row
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
|
||||||
|
(2,),
|
||||||
|
)
|
||||||
|
row = await cur.fetchone()
|
||||||
|
self.assertEqual(row["event_type"], "boot_pool_drive_unlocked")
|
||||||
|
|
||||||
|
async def test_audit_event_for_exported_uses_distinct_type(self):
|
||||||
|
from app import burnin
|
||||||
|
await burnin.grant_pool_unlock(
|
||||||
|
3, "DESTROY EXPORTED POOL", "alice", "decommissioned pool"
|
||||||
|
)
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
db.row_factory = aiosqlite.Row
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
|
||||||
|
(3,),
|
||||||
|
)
|
||||||
|
row = await cur.fetchone()
|
||||||
|
self.assertEqual(row["event_type"], "exported_pool_drive_unlocked")
|
||||||
|
|
||||||
|
async def test_failed_token_does_not_record_audit_event(self):
|
||||||
|
from app import burnin
|
||||||
|
try:
|
||||||
|
await burnin.grant_pool_unlock(1, "wrong-token", "op", "valid reason")
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
cur = await db.execute(
|
||||||
|
"SELECT COUNT(*) FROM audit_events WHERE drive_id=?", (1,)
|
||||||
|
)
|
||||||
|
self.assertEqual((await cur.fetchone())[0], 0)
|
||||||
|
# And no in-memory grant was armed.
|
||||||
|
self.assertNotIn(1, burnin._unlock_grants)
|
||||||
|
|
||||||
|
|
||||||
|
class TestActiveJobUniqueIndex(unittest.IsolatedAsyncioTestCase):
|
||||||
|
"""Codex finding #4 — the partial unique index on burnin_jobs(drive_id)
|
||||||
|
WHERE state IN ('queued','running') must reject a second active row even
|
||||||
|
when two requests pass the SELECT-COUNT check concurrently."""
|
||||||
|
|
||||||
|
async def asyncSetUp(self):
|
||||||
|
self.db_path = await _setup_temp_db()
|
||||||
|
from app import burnin
|
||||||
|
burnin._unlock_grants.clear()
|
||||||
|
# Need to clear the pool field on drive 1 so unlock isn't required
|
||||||
|
# for these race tests.
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
|
||||||
|
await db.commit()
|
||||||
|
# Burnin orchestrator init for the semaphore
|
||||||
|
from app import burnin as b
|
||||||
|
import asyncio as _a
|
||||||
|
b._semaphore = _a.Semaphore(4)
|
||||||
|
|
||||||
|
async def asyncTearDown(self):
|
||||||
|
try:
|
||||||
|
os.unlink(self.db_path)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
async def test_index_blocks_second_active_insert(self):
|
||||||
|
# Insert a queued row by hand, then try a second one — index fires.
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
||||||
|
VALUES (?,?,?,?,?,?)""",
|
||||||
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
with self.assertRaises(aiosqlite.IntegrityError):
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
||||||
|
VALUES (?,?,?,?,?,?)""",
|
||||||
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:01+00:00"),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
|
||||||
|
async def test_index_allows_terminal_state_then_new_job(self):
|
||||||
|
# passed/failed/cancelled/unknown rows must not block a fresh queue.
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
for state in ("passed", "failed", "cancelled", "unknown"):
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
||||||
|
VALUES (?,?,?,?,?,?)""",
|
||||||
|
(1, "surface", state, 100, "op", "2026-05-02T00:00:00+00:00"),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
# Should succeed — no other queued/running row exists.
|
||||||
|
await db.execute(
|
||||||
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
||||||
|
VALUES (?,?,?,?,?,?)""",
|
||||||
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
|
||||||
|
)
|
||||||
|
await db.commit()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Loading…
Add table
Reference in a new issue