nas-burnin/tests/test_unlock_flow.py
Brandon Walter 5da1a1704f feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.

1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
  (sshd ignores SSH signal-channel requests by default), so a cancel of a
  long-running badblocks left the remote process running and proc.wait()
  hanging — pinning the asyncio.Semaphore slot forever.

  * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
    capture the remote PID; store in burnin._remote_pids[job_id].
  * burnin._kill_remote_process(job_id) opens a fresh SSH session and
    issues `kill -9 <pid>` — sshd honours that.
  * Bound proc.wait() with asyncio.wait_for(timeout=15).
  * burnin._active_tasks tracks every _run_job task so cancel_job and
    check_stuck_jobs can actually cancel the asyncio task (was DB-only
    before). Also fixes the documented asyncio.create_task GC gotcha
    (weak refs only).
  * _run_job finalizer reads current state and skips the write if state
    != 'running' so cancelled/unknown aren't clobbered.

1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
  health/poll timestamps; devname/serial/model/size_bytes were stuck at
  first-INSERT values forever. After kernel SCSI re-enumeration two
  drives could both show as `sda`. Fixed by updating all six fields.
  Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
  off the dashboard while audit/burnin_jobs FKs stay intact.

1.0.0-15/-16 — pool-membership lock.
  * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
    the flattened TrueNAS output (container vdevs + their device children
    both appear at depth 1; section markers cache/log/spare/special/dedup
    switch the role).
  * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
    to detect drives carrying ZFS labels not in any active pool — they
    get pool_name='(exported)', pool_role='exported'.
  * Three idempotent ALTER TABLE migrations on drives:
    pool_name/pool_role/pool_seen_at.
  * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
    the drive isn't in burnin._unlock_grants. Routes layer maps to 409
    with structured detail {pool_name, pool_role, pool_locked: true} so
    the frontend can render an unlock affordance.
  * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
    reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
    for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
    chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
    pool_drive_unlocked / boot_pool_drive_unlocked /
    exported_pool_drive_unlocked.
  * Grants are in-memory only — container restart wipes them.
  * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
    Burn-In button. modal_unlock.html with type-to-confirm field.
    Live unlock countdown via tickUnlockCountdowns() in app.js.
  * Daily report: red banner listing every unlock event from the last
    24h, with operator + reason + timestamp.

1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
  * ssh_client.get_pool_membership / get_zfs_member_drives now return
    None on failure (vs {} for 'definitely empty'). poller passes
    update_pool=False to _upsert_drive on detection failure, preserving
    existing pool columns instead of clearing them. Without this fix a
    1-second SSH blip silently unlocked every drive.
  * mailer._build_unlock_banner_html escapes every interpolated field
    via html.escape() (was '<' only). Time filter switched to
    julianday() — string >= against datetime('now', '-1 day') compared
    formats with different separators ('T' vs ' ') and timezone
    suffixes, causing subtle off-by-N-hour inclusion.
  * app.js submitStart/submitBatchStart now detect the structured
    pool_locked 409 detail and auto-open the unlock modal for the
    offending drive (was [object Object] in toast).

1.0.0-18 — Codex grant-binding + commit-ordering fixes.
  * Unlock grants bound to the (pool_name, pool_role) observed at unlock
    time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
    invalidate the grant if the live row's pool identity has changed.
    Prevents an 'exported' unlock from carrying over when the drive
    turns out to be in active 'tank' or 'boot-pool'.
  * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
    succeeds — previously a failed audit insert left an unaudited grant
    armed.

1.0.0-19 — Codex race + cancellation classification + test scaffold.
  * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
    (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
    try/except aiosqlite.IntegrityError -> ValueError so the read-then-
    insert race in start_job can't produce two queued rows for the same
    drive.
  * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
    future code paths) where DB state is still 'running', finalizer
    writes 'unknown' instead of mis-classifying as 'failed'.
  * tests/ stdlib unittest scaffold:
    - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
      single-disk depth-1, plural section markers, partition stripping,
      sdaa-style names, multi-pool, role reset between pools.
    - test_unlock_flow.py (18 tests): token validation per pool kind,
      identity-binding invalidation, TTL expiry, audit-commit-then-arm
      ordering, unique-active-burnin partial index.
    Run via `python -m unittest discover tests/`. No new dependencies.

1.0.0-20 — Spearfoot-inspired badblocks tunables.
  * surface_validate_block_size (-b, default 4096), surface_validate_
    block_buffer (-c, default 64), surface_validate_passes (-p, default
    1) exposed in Settings UI; persist via settings_store.json.
    Validation: block size must be a power of 2 between 512 and
    1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
    roughly halves runtime on multi-TB HDDs at ~2x RAM cost.

1.0.0-21 — SMART overall-health column actually populated.
  * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
    to UNKNOWN forever (only burn-in stages ever wrote a real value).
  * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
    all drives in a single SSH session, deterministically delimited with
    @@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
    UNKNOWN} or None on SSH failure.
  * poller calls it every 5th cycle (~1 min at default 12s interval),
    caches in _state['smart_health_cache'] so transient failures preserve
    the previous values.
  * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
    -> 6 so Short/Long SMART columns fit comfortably on a 13-inch
    display.
  * 5 additional parser tests (44 total, all passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:25:56 -04:00

303 lines
12 KiB
Python

"""Unit tests for the pool-drive unlock state machine in burnin.py.
Covers: token validation per pool kind, identity-binding (grant
invalidated when pool_name/pool_role changes), TTL expiry, the
audit-commit-then-arm ordering (a failing audit insert leaves no
in-memory grant), and the unique-active-burnin partial index that
prevents duplicate queued rows for the same drive.
Uses an in-memory SQLite DB and monkeypatches app.config.settings.db_path.
No SSH, no network, no FastAPI.
Run with: python -m unittest discover tests/ -v
"""
import os
import tempfile
import time
import unittest
import aiosqlite
async def _setup_temp_db() -> str:
"""Create a temp SQLite file, point app.config at it, init schema.
Async-callable from IsolatedAsyncioTestCase.asyncSetUp."""
fd, path = tempfile.mkstemp(suffix=".db")
os.close(fd)
from app.config import settings
settings.db_path = path
from app.database import init_db
await init_db()
# Seed pool drives so unlock_flow tests have something to grant on.
async with aiosqlite.connect(path) as db:
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-1', 'sda', 'TESTSER1', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'tank', 'data', '2026-05-02T00:00:00+00:00')
""")
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-2', 'sdb', 'TESTSER2', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'boot-pool', 'data', '2026-05-02T00:00:00+00:00')
""")
await db.execute("""
INSERT INTO drives
(truenas_disk_id, devname, serial, model, size_bytes,
temperature_c, smart_health, last_seen_at, last_polled_at,
pool_name, pool_role, pool_seen_at)
VALUES ('test-id-3', 'sdc', 'TESTSER3', 'TestModel', 1000,
30, 'PASSED', '2026-05-02T00:00:00+00:00',
'2026-05-02T00:00:00+00:00',
'(exported)', 'exported', '2026-05-02T00:00:00+00:00')
""")
await db.commit()
return path
class TestUnlockFlow(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self):
self.db_path = await _setup_temp_db()
# Reset module state so previous test runs don't bleed in.
from app import burnin
burnin._unlock_grants.clear()
async def asyncTearDown(self):
try:
os.unlink(self.db_path)
except OSError:
pass
# ----- token validation per pool kind -----
async def test_active_pool_token_is_pool_name(self):
from app import burnin
# Drive 1 = tank/data
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "wrong", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertGreater(expiry, time.time())
async def test_boot_pool_token_is_destroy_phrase(self):
from app import burnin
# Drive 2 = boot-pool — typing the pool name must NOT work.
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(2, "boot-pool", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(
2, "DESTROY BOOT POOL", "op", "valid reason"
)
self.assertGreater(expiry, time.time())
async def test_exported_token_is_destroy_phrase(self):
from app import burnin
# Drive 3 = (exported)/exported
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(3, "(exported)", "op", "valid reason")
expiry = await burnin.grant_pool_unlock(
3, "DESTROY EXPORTED POOL", "op", "valid reason"
)
self.assertGreater(expiry, time.time())
# ----- input validation -----
async def test_empty_reason_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "")
async def test_short_reason_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "hi")
async def test_empty_operator_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "", "valid reason")
async def test_unknown_drive_rejected(self):
from app import burnin
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(99999, "anything", "op", "valid reason")
async def test_drive_not_in_pool_rejected(self):
from app import burnin
# Manually clear pool fields on drive 1
async with aiosqlite.connect(self.db_path) as db:
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
await db.commit()
with self.assertRaises(ValueError):
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# ----- identity binding (Codex finding #2) -----
async def test_grant_invalidated_when_pool_name_changes(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# Operator's grant references tank/data; pool detection now reports tank2.
self.assertTrue(burnin._is_unlocked(1, "tank", "data"))
self.assertFalse(burnin._is_unlocked(1, "tank2", "data"))
# And the side effect: the grant is reaped, not just temporarily denied.
self.assertNotIn(1, burnin._unlock_grants)
async def test_grant_invalidated_when_pool_role_changes(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
# Same pool, different role (data -> cache).
self.assertFalse(burnin._is_unlocked(1, "tank", "cache"))
self.assertNotIn(1, burnin._unlock_grants)
async def test_unlock_expiry_returns_none_for_mismatched_identity(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertIsNotNone(burnin.unlock_expiry(1, "tank", "data"))
self.assertIsNone(burnin.unlock_expiry(1, "tank2", "data"))
# ----- TTL expiry -----
async def test_expired_grant_returns_false(self):
from app import burnin
# Drop TTL to 0 so the grant is born expired.
original = burnin.UNLOCK_TTL_SECONDS
burnin.UNLOCK_TTL_SECONDS = 0
try:
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
self.assertFalse(burnin._is_unlocked(1, "tank", "data"))
self.assertNotIn(1, burnin._unlock_grants)
finally:
burnin.UNLOCK_TTL_SECONDS = original
# ----- audit commit ordering (Codex finding #3) -----
async def test_audit_event_recorded_for_active_pool(self):
from app import burnin
await burnin.grant_pool_unlock(1, "tank", "alice", "swapping out drive")
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type, operator, message FROM audit_events "
"WHERE drive_id=? ORDER BY id DESC LIMIT 1", (1,)
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "pool_drive_unlocked")
self.assertEqual(row["operator"], "alice")
self.assertIn("swapping out drive", row["message"])
async def test_audit_event_for_boot_pool_uses_distinct_type(self):
from app import burnin
await burnin.grant_pool_unlock(
2, "DESTROY BOOT POOL", "alice", "replacing failed mirror"
)
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
(2,),
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "boot_pool_drive_unlocked")
async def test_audit_event_for_exported_uses_distinct_type(self):
from app import burnin
await burnin.grant_pool_unlock(
3, "DESTROY EXPORTED POOL", "alice", "decommissioned pool"
)
async with aiosqlite.connect(self.db_path) as db:
db.row_factory = aiosqlite.Row
cur = await db.execute(
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
(3,),
)
row = await cur.fetchone()
self.assertEqual(row["event_type"], "exported_pool_drive_unlocked")
async def test_failed_token_does_not_record_audit_event(self):
from app import burnin
try:
await burnin.grant_pool_unlock(1, "wrong-token", "op", "valid reason")
except ValueError:
pass
async with aiosqlite.connect(self.db_path) as db:
cur = await db.execute(
"SELECT COUNT(*) FROM audit_events WHERE drive_id=?", (1,)
)
self.assertEqual((await cur.fetchone())[0], 0)
# And no in-memory grant was armed.
self.assertNotIn(1, burnin._unlock_grants)
class TestActiveJobUniqueIndex(unittest.IsolatedAsyncioTestCase):
"""Codex finding #4 — the partial unique index on burnin_jobs(drive_id)
WHERE state IN ('queued','running') must reject a second active row even
when two requests pass the SELECT-COUNT check concurrently."""
async def asyncSetUp(self):
self.db_path = await _setup_temp_db()
from app import burnin
burnin._unlock_grants.clear()
# Need to clear the pool field on drive 1 so unlock isn't required
# for these race tests.
async with aiosqlite.connect(self.db_path) as db:
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
await db.commit()
# Burnin orchestrator init for the semaphore
from app import burnin as b
import asyncio as _a
b._semaphore = _a.Semaphore(4)
async def asyncTearDown(self):
try:
os.unlink(self.db_path)
except OSError:
pass
async def test_index_blocks_second_active_insert(self):
# Insert a queued row by hand, then try a second one — index fires.
async with aiosqlite.connect(self.db_path) as db:
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
with self.assertRaises(aiosqlite.IntegrityError):
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:01+00:00"),
)
await db.commit()
async def test_index_allows_terminal_state_then_new_job(self):
# passed/failed/cancelled/unknown rows must not block a fresh queue.
async with aiosqlite.connect(self.db_path) as db:
for state in ("passed", "failed", "cancelled", "unknown"):
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", state, 100, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
# Should succeed — no other queued/running row exists.
await db.execute(
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
VALUES (?,?,?,?,?,?)""",
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
)
await db.commit()
if __name__ == "__main__":
unittest.main()