Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.
1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
(sshd ignores SSH signal-channel requests by default), so a cancel of a
long-running badblocks left the remote process running and proc.wait()
hanging — pinning the asyncio.Semaphore slot forever.
* Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
capture the remote PID; store in burnin._remote_pids[job_id].
* burnin._kill_remote_process(job_id) opens a fresh SSH session and
issues `kill -9 <pid>` — sshd honours that.
* Bound proc.wait() with asyncio.wait_for(timeout=15).
* burnin._active_tasks tracks every _run_job task so cancel_job and
check_stuck_jobs can actually cancel the asyncio task (was DB-only
before). Also fixes the documented asyncio.create_task GC gotcha
(weak refs only).
* _run_job finalizer reads current state and skips the write if state
!= 'running' so cancelled/unknown aren't clobbered.
1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
health/poll timestamps; devname/serial/model/size_bytes were stuck at
first-INSERT values forever. After kernel SCSI re-enumeration two
drives could both show as `sda`. Fixed by updating all six fields.
Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
off the dashboard while audit/burnin_jobs FKs stay intact.
1.0.0-15/-16 — pool-membership lock.
* ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
the flattened TrueNAS output (container vdevs + their device children
both appear at depth 1; section markers cache/log/spare/special/dedup
switch the role).
* ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
to detect drives carrying ZFS labels not in any active pool — they
get pool_name='(exported)', pool_role='exported'.
* Three idempotent ALTER TABLE migrations on drives:
pool_name/pool_role/pool_seen_at.
* burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
the drive isn't in burnin._unlock_grants. Routes layer maps to 409
with structured detail {pool_name, pool_role, pool_locked: true} so
the frontend can render an unlock affordance.
* POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
pool_drive_unlocked / boot_pool_drive_unlocked /
exported_pool_drive_unlocked.
* Grants are in-memory only — container restart wipes them.
* UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
Burn-In button. modal_unlock.html with type-to-confirm field.
Live unlock countdown via tickUnlockCountdowns() in app.js.
* Daily report: red banner listing every unlock event from the last
24h, with operator + reason + timestamp.
1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
* ssh_client.get_pool_membership / get_zfs_member_drives now return
None on failure (vs {} for 'definitely empty'). poller passes
update_pool=False to _upsert_drive on detection failure, preserving
existing pool columns instead of clearing them. Without this fix a
1-second SSH blip silently unlocked every drive.
* mailer._build_unlock_banner_html escapes every interpolated field
via html.escape() (was '<' only). Time filter switched to
julianday() — string >= against datetime('now', '-1 day') compared
formats with different separators ('T' vs ' ') and timezone
suffixes, causing subtle off-by-N-hour inclusion.
* app.js submitStart/submitBatchStart now detect the structured
pool_locked 409 detail and auto-open the unlock modal for the
offending drive (was [object Object] in toast).
1.0.0-18 — Codex grant-binding + commit-ordering fixes.
* Unlock grants bound to the (pool_name, pool_role) observed at unlock
time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
invalidate the grant if the live row's pool identity has changed.
Prevents an 'exported' unlock from carrying over when the drive
turns out to be in active 'tank' or 'boot-pool'.
* grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
succeeds — previously a failed audit insert left an unaudited grant
armed.
1.0.0-19 — Codex race + cancellation classification + test scaffold.
* Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
(drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
try/except aiosqlite.IntegrityError -> ValueError so the read-then-
insert race in start_job can't produce two queued rows for the same
drive.
* _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
future code paths) where DB state is still 'running', finalizer
writes 'unknown' instead of mis-classifying as 'failed'.
* tests/ stdlib unittest scaffold:
- test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
single-disk depth-1, plural section markers, partition stripping,
sdaa-style names, multi-pool, role reset between pools.
- test_unlock_flow.py (18 tests): token validation per pool kind,
identity-binding invalidation, TTL expiry, audit-commit-then-arm
ordering, unique-active-burnin partial index.
Run via `python -m unittest discover tests/`. No new dependencies.
1.0.0-20 — Spearfoot-inspired badblocks tunables.
* surface_validate_block_size (-b, default 4096), surface_validate_
block_buffer (-c, default 64), surface_validate_passes (-p, default
1) exposed in Settings UI; persist via settings_store.json.
Validation: block size must be a power of 2 between 512 and
1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
roughly halves runtime on multi-TB HDDs at ~2x RAM cost.
1.0.0-21 — SMART overall-health column actually populated.
* /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
to UNKNOWN forever (only burn-in stages ever wrote a real value).
* ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
all drives in a single SSH session, deterministically delimited with
@@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
UNKNOWN} or None on SSH failure.
* poller calls it every 5th cycle (~1 min at default 12s interval),
caches in _state['smart_health_cache'] so transient failures preserve
the previous values.
* Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
-> 6 so Short/Long SMART columns fit comfortably on a 13-inch
display.
* 5 additional parser tests (44 total, all passing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
303 lines
12 KiB
Python
303 lines
12 KiB
Python
"""Unit tests for the pool-drive unlock state machine in burnin.py.
|
|
|
|
Covers: token validation per pool kind, identity-binding (grant
|
|
invalidated when pool_name/pool_role changes), TTL expiry, the
|
|
audit-commit-then-arm ordering (a failing audit insert leaves no
|
|
in-memory grant), and the unique-active-burnin partial index that
|
|
prevents duplicate queued rows for the same drive.
|
|
|
|
Uses an in-memory SQLite DB and monkeypatches app.config.settings.db_path.
|
|
No SSH, no network, no FastAPI.
|
|
|
|
Run with: python -m unittest discover tests/ -v
|
|
"""
|
|
|
|
import os
|
|
import tempfile
|
|
import time
|
|
import unittest
|
|
|
|
import aiosqlite
|
|
|
|
|
|
async def _setup_temp_db() -> str:
|
|
"""Create a temp SQLite file, point app.config at it, init schema.
|
|
Async-callable from IsolatedAsyncioTestCase.asyncSetUp."""
|
|
fd, path = tempfile.mkstemp(suffix=".db")
|
|
os.close(fd)
|
|
from app.config import settings
|
|
settings.db_path = path
|
|
|
|
from app.database import init_db
|
|
await init_db()
|
|
# Seed pool drives so unlock_flow tests have something to grant on.
|
|
async with aiosqlite.connect(path) as db:
|
|
await db.execute("""
|
|
INSERT INTO drives
|
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
|
pool_name, pool_role, pool_seen_at)
|
|
VALUES ('test-id-1', 'sda', 'TESTSER1', 'TestModel', 1000,
|
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
|
'2026-05-02T00:00:00+00:00',
|
|
'tank', 'data', '2026-05-02T00:00:00+00:00')
|
|
""")
|
|
await db.execute("""
|
|
INSERT INTO drives
|
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
|
pool_name, pool_role, pool_seen_at)
|
|
VALUES ('test-id-2', 'sdb', 'TESTSER2', 'TestModel', 1000,
|
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
|
'2026-05-02T00:00:00+00:00',
|
|
'boot-pool', 'data', '2026-05-02T00:00:00+00:00')
|
|
""")
|
|
await db.execute("""
|
|
INSERT INTO drives
|
|
(truenas_disk_id, devname, serial, model, size_bytes,
|
|
temperature_c, smart_health, last_seen_at, last_polled_at,
|
|
pool_name, pool_role, pool_seen_at)
|
|
VALUES ('test-id-3', 'sdc', 'TESTSER3', 'TestModel', 1000,
|
|
30, 'PASSED', '2026-05-02T00:00:00+00:00',
|
|
'2026-05-02T00:00:00+00:00',
|
|
'(exported)', 'exported', '2026-05-02T00:00:00+00:00')
|
|
""")
|
|
await db.commit()
|
|
return path
|
|
|
|
|
|
class TestUnlockFlow(unittest.IsolatedAsyncioTestCase):
|
|
|
|
async def asyncSetUp(self):
|
|
self.db_path = await _setup_temp_db()
|
|
# Reset module state so previous test runs don't bleed in.
|
|
from app import burnin
|
|
burnin._unlock_grants.clear()
|
|
|
|
async def asyncTearDown(self):
|
|
try:
|
|
os.unlink(self.db_path)
|
|
except OSError:
|
|
pass
|
|
|
|
# ----- token validation per pool kind -----
|
|
|
|
async def test_active_pool_token_is_pool_name(self):
|
|
from app import burnin
|
|
# Drive 1 = tank/data
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(1, "wrong", "op", "valid reason")
|
|
expiry = await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
self.assertGreater(expiry, time.time())
|
|
|
|
async def test_boot_pool_token_is_destroy_phrase(self):
|
|
from app import burnin
|
|
# Drive 2 = boot-pool — typing the pool name must NOT work.
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(2, "boot-pool", "op", "valid reason")
|
|
expiry = await burnin.grant_pool_unlock(
|
|
2, "DESTROY BOOT POOL", "op", "valid reason"
|
|
)
|
|
self.assertGreater(expiry, time.time())
|
|
|
|
async def test_exported_token_is_destroy_phrase(self):
|
|
from app import burnin
|
|
# Drive 3 = (exported)/exported
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(3, "(exported)", "op", "valid reason")
|
|
expiry = await burnin.grant_pool_unlock(
|
|
3, "DESTROY EXPORTED POOL", "op", "valid reason"
|
|
)
|
|
self.assertGreater(expiry, time.time())
|
|
|
|
# ----- input validation -----
|
|
|
|
async def test_empty_reason_rejected(self):
|
|
from app import burnin
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "")
|
|
|
|
async def test_short_reason_rejected(self):
|
|
from app import burnin
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "hi")
|
|
|
|
async def test_empty_operator_rejected(self):
|
|
from app import burnin
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(1, "tank", "", "valid reason")
|
|
|
|
async def test_unknown_drive_rejected(self):
|
|
from app import burnin
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(99999, "anything", "op", "valid reason")
|
|
|
|
async def test_drive_not_in_pool_rejected(self):
|
|
from app import burnin
|
|
# Manually clear pool fields on drive 1
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
|
|
await db.commit()
|
|
with self.assertRaises(ValueError):
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
|
|
# ----- identity binding (Codex finding #2) -----
|
|
|
|
async def test_grant_invalidated_when_pool_name_changes(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
# Operator's grant references tank/data; pool detection now reports tank2.
|
|
self.assertTrue(burnin._is_unlocked(1, "tank", "data"))
|
|
self.assertFalse(burnin._is_unlocked(1, "tank2", "data"))
|
|
# And the side effect: the grant is reaped, not just temporarily denied.
|
|
self.assertNotIn(1, burnin._unlock_grants)
|
|
|
|
async def test_grant_invalidated_when_pool_role_changes(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
# Same pool, different role (data -> cache).
|
|
self.assertFalse(burnin._is_unlocked(1, "tank", "cache"))
|
|
self.assertNotIn(1, burnin._unlock_grants)
|
|
|
|
async def test_unlock_expiry_returns_none_for_mismatched_identity(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
self.assertIsNotNone(burnin.unlock_expiry(1, "tank", "data"))
|
|
self.assertIsNone(burnin.unlock_expiry(1, "tank2", "data"))
|
|
|
|
# ----- TTL expiry -----
|
|
|
|
async def test_expired_grant_returns_false(self):
|
|
from app import burnin
|
|
# Drop TTL to 0 so the grant is born expired.
|
|
original = burnin.UNLOCK_TTL_SECONDS
|
|
burnin.UNLOCK_TTL_SECONDS = 0
|
|
try:
|
|
await burnin.grant_pool_unlock(1, "tank", "op", "valid reason")
|
|
self.assertFalse(burnin._is_unlocked(1, "tank", "data"))
|
|
self.assertNotIn(1, burnin._unlock_grants)
|
|
finally:
|
|
burnin.UNLOCK_TTL_SECONDS = original
|
|
|
|
# ----- audit commit ordering (Codex finding #3) -----
|
|
|
|
async def test_audit_event_recorded_for_active_pool(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(1, "tank", "alice", "swapping out drive")
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
db.row_factory = aiosqlite.Row
|
|
cur = await db.execute(
|
|
"SELECT event_type, operator, message FROM audit_events "
|
|
"WHERE drive_id=? ORDER BY id DESC LIMIT 1", (1,)
|
|
)
|
|
row = await cur.fetchone()
|
|
self.assertEqual(row["event_type"], "pool_drive_unlocked")
|
|
self.assertEqual(row["operator"], "alice")
|
|
self.assertIn("swapping out drive", row["message"])
|
|
|
|
async def test_audit_event_for_boot_pool_uses_distinct_type(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(
|
|
2, "DESTROY BOOT POOL", "alice", "replacing failed mirror"
|
|
)
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
db.row_factory = aiosqlite.Row
|
|
cur = await db.execute(
|
|
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
|
|
(2,),
|
|
)
|
|
row = await cur.fetchone()
|
|
self.assertEqual(row["event_type"], "boot_pool_drive_unlocked")
|
|
|
|
async def test_audit_event_for_exported_uses_distinct_type(self):
|
|
from app import burnin
|
|
await burnin.grant_pool_unlock(
|
|
3, "DESTROY EXPORTED POOL", "alice", "decommissioned pool"
|
|
)
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
db.row_factory = aiosqlite.Row
|
|
cur = await db.execute(
|
|
"SELECT event_type FROM audit_events WHERE drive_id=? ORDER BY id DESC LIMIT 1",
|
|
(3,),
|
|
)
|
|
row = await cur.fetchone()
|
|
self.assertEqual(row["event_type"], "exported_pool_drive_unlocked")
|
|
|
|
async def test_failed_token_does_not_record_audit_event(self):
|
|
from app import burnin
|
|
try:
|
|
await burnin.grant_pool_unlock(1, "wrong-token", "op", "valid reason")
|
|
except ValueError:
|
|
pass
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
cur = await db.execute(
|
|
"SELECT COUNT(*) FROM audit_events WHERE drive_id=?", (1,)
|
|
)
|
|
self.assertEqual((await cur.fetchone())[0], 0)
|
|
# And no in-memory grant was armed.
|
|
self.assertNotIn(1, burnin._unlock_grants)
|
|
|
|
|
|
class TestActiveJobUniqueIndex(unittest.IsolatedAsyncioTestCase):
|
|
"""Codex finding #4 — the partial unique index on burnin_jobs(drive_id)
|
|
WHERE state IN ('queued','running') must reject a second active row even
|
|
when two requests pass the SELECT-COUNT check concurrently."""
|
|
|
|
async def asyncSetUp(self):
|
|
self.db_path = await _setup_temp_db()
|
|
from app import burnin
|
|
burnin._unlock_grants.clear()
|
|
# Need to clear the pool field on drive 1 so unlock isn't required
|
|
# for these race tests.
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
await db.execute("UPDATE drives SET pool_name=NULL, pool_role=NULL WHERE id=1")
|
|
await db.commit()
|
|
# Burnin orchestrator init for the semaphore
|
|
from app import burnin as b
|
|
import asyncio as _a
|
|
b._semaphore = _a.Semaphore(4)
|
|
|
|
async def asyncTearDown(self):
|
|
try:
|
|
os.unlink(self.db_path)
|
|
except OSError:
|
|
pass
|
|
|
|
async def test_index_blocks_second_active_insert(self):
|
|
# Insert a queued row by hand, then try a second one — index fires.
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
await db.execute(
|
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
|
VALUES (?,?,?,?,?,?)""",
|
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
|
|
)
|
|
await db.commit()
|
|
with self.assertRaises(aiosqlite.IntegrityError):
|
|
await db.execute(
|
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
|
VALUES (?,?,?,?,?,?)""",
|
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:01+00:00"),
|
|
)
|
|
await db.commit()
|
|
|
|
async def test_index_allows_terminal_state_then_new_job(self):
|
|
# passed/failed/cancelled/unknown rows must not block a fresh queue.
|
|
async with aiosqlite.connect(self.db_path) as db:
|
|
for state in ("passed", "failed", "cancelled", "unknown"):
|
|
await db.execute(
|
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
|
VALUES (?,?,?,?,?,?)""",
|
|
(1, "surface", state, 100, "op", "2026-05-02T00:00:00+00:00"),
|
|
)
|
|
await db.commit()
|
|
# Should succeed — no other queued/running row exists.
|
|
await db.execute(
|
|
"""INSERT INTO burnin_jobs (drive_id, profile, state, percent, operator, created_at)
|
|
VALUES (?,?,?,?,?,?)""",
|
|
(1, "surface", "queued", 0, "op", "2026-05-02T00:00:00+00:00"),
|
|
)
|
|
await db.commit()
|
|
|
|
|
|
if __name__ == "__main__":
|
|
unittest.main()
|