Compare commits

..

No commits in common. "main" and "v1.0.0-41" have entirely different histories.

18 changed files with 109 additions and 1388 deletions

101
README.md
View file

@ -83,12 +83,11 @@ runtime roughly in half at ~2× RAM cost — matches the upstream
### Watch out ### Watch out
- **Stuck-job timeout**`stuck_job_hours` (default 168 = 7 days) - **Stuck-job timeout**`stuck_job_hours` (default 24) marks any job
marks any job past that threshold as `unknown` and kills the remote past that threshold as `unknown` and kills the remote process. If
process. The default covers `-w` surface_validate on 14 TB+ HDDs with you're burning in 14 TB drives with default block size, raise this to
margin. If you're running short SSDs and want faster detection of **48** in Settings before starting, or you'll get false positives near
genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h the end of surface_validate.
which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the - **Thermal gate** — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
@ -106,91 +105,6 @@ Click the red ✕ next to a running job. The orchestrator:
Cancellations are durable — restart the container and queued jobs resume, Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled. cancelled jobs stay cancelled.
### Job states explained
| State | When it's set |
|-------------|-------------------------------------------------------------------------------|
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
| `running` | Actively executing some stage |
| `passed` | All stages finished green |
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
| `cancelled` | Operator clicked ✕ |
| `unknown` | Job was alive but its outcome is indeterminate — see below |
`unknown` fires in two situations:
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
the job has been running too long without finishing.
2. The asyncio task got cancelled mid-stage by something *other* than an
operator click — usually a container restart (`docker compose up -d`,
`--build`, or the host rebooting). Burn-in source code goes through
the Dockerfile `COPY`, so any source-code deploy recreates the
container, drops the SSH connection to TrueNAS, and would orphan the
running burn-in. Avoid `--build` while burn-ins are active.
When `unknown` fires the drawer's per-stage Reason block shows
*"Task cancelled mid-run — likely container restart or shutdown"* so the
classification is explicit, not silent.
---
## Drive drawer
Click any drive row to slide a detail drawer down from the top. Three tabs:
- **Burn-In** — per-stage breakdown of the latest job
- **SMART** — short/long test states + cached SMART attributes
- **Events** — last 50 audit events for the drive
### Surface-validate visualization
For drives in a `surface_validate` stage (running or finished), the Burn-In
tab renders:
1. **Vital-signs strip**`Start` (with date) · `Elapsed` · `ETA` (duration
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
stutter at the very start.
2. **Four pattern meters**`0xaa` / `0x55` / `0xff` / `0x00`. Each meter
is split into a left half (write phase, blue) and a right half (verify
phase, green). Current pattern's label glows blue; completed patterns'
labels go green. This translates badblocks's per-phase percent into
monotonic 0-99% overall progress, so the bar never appears to "rewind"
when a new phase starts.
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
within phase"*. Makes the visual grammar unambiguous.
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
first pattern's elapsed time.
### Failure reason block
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:
1. The stage's own `error_text`
2. The parent job's `error_text` (backfilled by the drawer when the stage's
own is empty — catches orphan rows from hard crashes)
3. A heuristic: if the log is tiny and no real progress was recorded,
*"Stopped without recording an error — likely cause: SSH connection drop
or container restart while this stage was running"*
Otherwise: *"No error message recorded."* — there's never a blank where you
expect to see why something broke.
### Column sorting
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in `localStorage` so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
with numeric priority maps for SMART states (e.g. `running` always sorts
ahead of `idle`).
--- ---
## Drive locks ## Drive locks
@ -230,8 +144,7 @@ All settings live under `/settings` (header link). Key knobs:
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** — - **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
tune for speed vs paranoia. tune for speed vs paranoia.
- **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs; - **`stuck_job_hours`** (default 24) — raise for big drives.
drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks - **`bad_block_threshold`** (default 0) — number of bad blocks
surface_validate tolerates before failing the stage. surface_validate tolerates before failing the stage.
@ -346,7 +259,7 @@ pinned version after the fact.
- `CLAUDE.md` — full architecture, file map, deploy workflow, and the - `CLAUDE.md` — full architecture, file map, deploy workflow, and the
rationale behind every non-obvious design decision. rationale behind every non-obvious design decision.
- `SPEC.md` — canonical feature reference per version. - `SPEC.md` — canonical feature reference per version.
- `tests/``python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`. - `tests/``python -m unittest discover tests/` (44 tests, stdlib-only).
--- ---

View file

@ -72,14 +72,6 @@ class User:
is_admin: bool is_admin: bool
def LoopbackUser(username: str = "monitor", full_name: str = "Autonomous Monitor") -> User:
"""Synthetic admin used by the loopback bypass in _AuthGateMiddleware.
id=0 (no real DB row) and is_admin=True so admin-gated routes work.
Only reachable when request.client.host is 127.0.0.1 / ::1
a process inside the container's network namespace (docker exec)."""
return User(id=0, username=username, full_name=full_name, is_admin=True)
def _now() -> str: def _now() -> str:
return datetime.now(timezone.utc).isoformat() return datetime.now(timezone.utc).isoformat()

View file

@ -93,7 +93,6 @@ async def init(client: TrueNASClient) -> None:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
# Mark interrupted running jobs as unknown # Mark interrupted running jobs as unknown
@ -162,7 +161,6 @@ async def start_job(drive_id: int, profile: str, operator: str,
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
# Reject duplicate active burn-in for same drive # Reject duplicate active burn-in for same drive
@ -263,7 +261,6 @@ async def cancel_job(job_id: int, operator: str) -> bool:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
cur = await db.execute( cur = await db.execute(
"SELECT state, drive_id FROM burnin_jobs WHERE id=?", (job_id,) "SELECT state, drive_id FROM burnin_jobs WHERE id=?", (job_id,)
@ -348,7 +345,6 @@ async def _run_job(job_id: int) -> None:
# Transition queued → running # Transition queued → running
async with _db() as db: async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
row = await (await db.execute( row = await (await db.execute(
"SELECT drive_id, profile FROM burnin_jobs WHERE id=?", (job_id,) "SELECT drive_id, profile FROM burnin_jobs WHERE id=?", (job_id,)
)).fetchone() )).fetchone()
@ -415,33 +411,11 @@ async def _run_job(job_id: int) -> None:
final_state = "unknown" final_state = "unknown"
else: else:
final_state = "passed" if success else "failed" final_state = "passed" if success else "failed"
# If the asyncio task was cancelled mid-stage (container shutdown,
# uvicorn reload, etc.), CancelledError propagates past
# _execute_stages, so any running stage row is still marked
# 'running' in the DB. Reconcile here: mark every still-running
# stage on this job as 'unknown' with the parent's finished_at,
# and stamp a default error_text so the drawer's Reason block has
# something concrete to show. Use a write that's idempotent under
# repeat (only touches rows still 'running').
cancel_err = (
"Task cancelled mid-run — likely container restart or shutdown"
if was_cancelled else None
)
async with _db() as db: async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute( await db.execute(
"UPDATE burnin_jobs SET state=?, percent=?, finished_at=?, error_text=? WHERE id=?", "UPDATE burnin_jobs SET state=?, percent=?, finished_at=?, error_text=? WHERE id=?",
(final_state, 100 if success else None, _now(), (final_state, 100 if success else None, _now(), error_text, job_id),
error_text or cancel_err, job_id),
)
if was_cancelled:
await db.execute(
"""UPDATE burnin_stages
SET state='unknown', finished_at=?,
error_text=COALESCE(error_text, ?)
WHERE burnin_job_id=? AND state='running'""",
(_now(), cancel_err, job_id),
) )
await db.execute( await db.execute(
"""INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message) """INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message)
@ -568,7 +542,6 @@ async def check_stuck_jobs() -> None:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
cur = await db.execute(""" cur = await db.execute("""
SELECT bj.id, bj.drive_id, d.devname, bj.started_at SELECT bj.id, bj.drive_id, d.devname, bj.started_at

View file

@ -77,13 +77,9 @@ def _now() -> str:
@asynccontextmanager @asynccontextmanager
async def _db(): async def _db():
"""Open a WAL-mode connection with busy_timeout so writers wait for the lock """Open a WAL-mode connection with busy_timeout so writers wait for the lock
instead of immediately raising 'database is locked' under contention. instead of immediately raising 'database is locked' under contention."""
60s timeout is intentionally generous: with 4 concurrent burn-in drains
+ the poller + retention + auth all writing, brief contention spikes
are normal and waiting is the right behavior. 10s was too tight."""
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
await db.execute("PRAGMA busy_timeout=60000") await db.execute("PRAGMA busy_timeout=10000")
yield db yield db
@ -194,72 +190,6 @@ async def _update_stage_bad_blocks(job_id: int, stage_name: str, count: int) ->
await db.commit() await db.commit()
async def _update_stage_bb_phase(
job_id: int, stage_name: str, phase: int, phase_pct: float,
) -> None:
"""Persist per-pattern badblocks progress so the drive-drawer UI
can render 4 meters with separate write/verify halves."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE burnin_stages SET bb_phase=?, bb_phase_pct=? "
"WHERE burnin_job_id=? AND stage_name=?",
(phase, phase_pct, job_id, stage_name),
)
await db.commit()
async def _update_stage_bb_mbps(
job_id: int, stage_name: str, mbps: float,
) -> None:
"""Persist live throughput for the surface_validate meter strip.
Computed from delta_overall_pct between successive badblocks
progress lines, scaled by drive size_bytes / 800 (8 phases × 100)."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE burnin_stages SET bb_mbps=? "
"WHERE burnin_job_id=? AND stage_name=?",
(mbps, job_id, stage_name),
)
await db.commit()
async def _record_bb_phase_start(
job_id: int, stage_name: str, phase: int, ts: str,
) -> None:
"""Record the moment a phase first becomes current. Idempotent:
re-entry of the same phase keeps the original timestamp so a
transient parser reset doesn't blow away history.
Stored as a JSON object keyed by phase number (string). The
drawer reads it to compute per-pattern elapsed times.
"""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
cur = await db.execute(
"SELECT bb_phase_history FROM burnin_stages "
"WHERE burnin_job_id=? AND stage_name=?",
(job_id, stage_name),
)
row = await cur.fetchone()
existing = {}
if row and row[0]:
try:
existing = json.loads(row[0])
except (json.JSONDecodeError, TypeError):
existing = {}
key = str(phase)
if key not in existing:
existing[key] = ts
await db.execute(
"UPDATE burnin_stages SET bb_phase_history=? "
"WHERE burnin_job_id=? AND stage_name=?",
(json.dumps(existing), job_id, stage_name),
)
await db.commit()
async def _store_smart_attrs(drive_id: int, attrs: dict) -> None: async def _store_smart_attrs(drive_id: int, attrs: dict) -> None:
"""Persist latest SMART attribute dict to drives.smart_attrs (JSON).""" """Persist latest SMART attribute dict to drives.smart_attrs (JSON)."""
# Convert int keys to str for JSON serialisation # Convert int keys to str for JSON serialisation

View file

@ -25,110 +25,23 @@ class _BadblocksResult(TypedDict):
aborted: bool aborted: bool
# `badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
# each with a write phase followed by a read-back/verify phase = 8 phases.
# Per-phase percent comes back via `XX% done`; without translation, the
# dashboard appears to "rewind" every ~2 hours when a new phase starts.
_BB_PATTERN_PHASE = {"0xaa": 1, "0x55": 3, "0xff": 5, "0x00": 7}
_BB_TOTAL_PHASES = 8
# Throttle DB writes from the badblocks parser. Each progress line used
# to trigger 4-6 transactions; with 4 concurrent burn-ins emitting sub-
# second progress lines, the asyncssh drain couldn't keep up — the
# stdout pipe on TrueNAS filled, badblocks blocked on pipe_write,
# disk I/O effectively stopped. 5 seconds is fine for the UI (drawer
# polls every ~12s anyway) and cuts DB load 60-80x.
BB_DB_MIN_SECONDS = 5.0
import re as _re_pre # noqa: E402
_BB_PATTERN_RE = _re_pre.compile(r"Testing with pattern\s+(0x[0-9a-fA-F]+)")
_BB_VERIFY_RE = _re_pre.compile(r"Reading and comparing")
_BB_PERCENT_RE = _re_pre.compile(r"([\d.]+)%\s+done")
class _BadblocksProgress:
"""Track which phase of `badblocks -w -p N` we're in so the
displayed percent maps to overall progress, not per-phase progress.
Pure state machine no I/O. Feed it lines from the badblocks output
via :meth:`update`; read :attr:`overall_pct` after each call.
Behavior:
- Defaults to phase 1 (write 0xaa) before any header is seen.
- "Testing with pattern 0xXX" sets the phase to the write-phase index
for that pattern (1, 3, 5, or 7).
- "Reading and comparing" advances to the matching verify phase
(last_write_phase + 1).
- "XX% done" updates the in-phase percent.
- overall_pct = ((phase - 1) * 100 + phase_pct) / 8, clipped to 99
so we don't claim "100%" until the stage's success path explicitly
writes 100.
"""
__slots__ = ("phase", "phase_pct", "_last_write_phase")
def __init__(self) -> None:
self.phase: int = 1
self.phase_pct: float = 0.0
self._last_write_phase: int = 1
def update(self, line: str) -> None:
m = _BB_PATTERN_RE.search(line)
if m:
p = m.group(1).lower()
if p in _BB_PATTERN_PHASE:
self.phase = _BB_PATTERN_PHASE[p]
self._last_write_phase = self.phase
self.phase_pct = 0.0
return
if _BB_VERIFY_RE.search(line):
self.phase = self._last_write_phase + 1
self.phase_pct = 0.0
return
m = _BB_PERCENT_RE.search(line)
if m:
try:
self.phase_pct = float(m.group(1))
except ValueError:
pass
@property
def overall_pct(self) -> int:
total = (self.phase - 1) * 100.0 + self.phase_pct
return min(99, int(total / _BB_TOTAL_PHASES))
def _build_badblocks_cmd(devname: str) -> str: def _build_badblocks_cmd(devname: str) -> str:
"""Construct the wrapped badblocks command for a given device. """Construct the wrapped badblocks command for a given device.
badblocks's progress output uses '\\b' backspace characters to Wraps badblocks under `sh -c 'echo PID:$$; exec ...'` so we can
overwrite the previous "XX% done" line there's no '\\n' between capture the remote PID for out-of-band kill -9 (asyncssh's signal
updates until a phase transition. asyncssh's line-buffered reader channel is ignored by sshd). Geometry (-b -c -p) is operator-tunable
needs a real '\\n' to yield a line, so we pipe the output through via Settings Burn-in; defaults match the Spearfoot disk-burnin.sh
`tr '\\b' '\\n'` at the shell level. After this, every progress recommendation for large HDDs.
update is a normal newline-terminated line.
Inner shell does `echo PID:$$; exec badblocks ...` so $$ is the
badblocks PID after exec (needed for out-of-band kill -9; asyncssh's
signal channel is ignored by sshd). 2>&1 merges stderr into stdout
so tr sees the progress lines (badblocks emits them on stderr).
Geometry (-b -c -p) is operator-tunable via Settings Burn-in;
defaults match the Spearfoot disk-burnin.sh recommendation.
""" """
inner = ( return (
f"echo PID:$$; exec badblocks " f"sh -c 'echo PID:$$; exec badblocks "
f"-wsv " f"-wsv "
f"-b {settings.surface_validate_block_size} " f"-b {settings.surface_validate_block_size} "
f"-c {settings.surface_validate_block_buffer} " f"-c {settings.surface_validate_block_buffer} "
f"-p {settings.surface_validate_passes} " f"-p {settings.surface_validate_passes} "
f"/dev/{devname} 2>&1" f"/dev/{devname}'"
) )
# The outer pipeline lets tr translate \\b → \\n. stdbuf -oL forces
# tr's stdout to line-buffered mode; without it tr's stdout is
# block-buffered (4 KB chunks) when its destination is a pipe,
# which delays each progress line by ~6 minutes at our throughput.
return f"sh -c '{inner}' | stdbuf -oL tr '\\b' '\\n'"
from . import kill from . import kill
from ._common import ( from ._common import (
@ -136,16 +49,12 @@ from ._common import (
_append_stage_log, _append_stage_log,
_db, _db,
_is_cancelled, _is_cancelled,
_now,
_push_update, _push_update,
_recalculate_progress, _recalculate_progress,
_record_bb_phase_start,
_set_stage_error, _set_stage_error,
_store_smart_attrs, _store_smart_attrs,
_store_smart_raw_output, _store_smart_raw_output,
_update_stage_bad_blocks, _update_stage_bad_blocks,
_update_stage_bb_mbps,
_update_stage_bb_phase,
_update_stage_percent, _update_stage_percent,
) )
@ -490,17 +399,6 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
"""Run badblocks over SSH, streaming output to stage log.""" """Run badblocks over SSH, streaming output to stage log."""
from app import ssh_client from app import ssh_client
# Pull drive size for the throughput calculation. Each badblocks
# phase covers the full disk once, so 1% overall progress = size/800
# bytes (8 phases × 100). NULL-safe: if size lookup fails we just
# skip the MB/s update.
drive_size_bytes: int | None = None
async with _db() as db:
cur = await db.execute("SELECT size_bytes FROM drives WHERE id=?", (drive_id,))
row = await cur.fetchone()
if row and row[0]:
drive_size_bytes = int(row[0])
await _append_stage_log( await _append_stage_log(
job_id, "surface_validate", job_id, "surface_validate",
f"[START] badblocks -wsv -b {settings.surface_validate_block_size} " f"[START] badblocks -wsv -b {settings.surface_validate_block_size} "
@ -527,47 +425,17 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
# #
cmd = _build_badblocks_cmd(devname) cmd = _build_badblocks_cmd(devname)
async with conn.create_process(cmd) as proc: async with conn.create_process(cmd) as proc:
import re as _re
pid_seen = False pid_seen = False
progress = _BadblocksProgress()
# Throughput tracker — store (overall_pct, monotonic_ts)
# of the previous progress sample so we can compute MB/s
# from the delta on each new sample.
last_pct_sample: float = progress.overall_pct
last_db_write_ts: float = time.monotonic()
# Lines accumulated since last log flush. Flushed in the
# throttled DB-write window (see BB_DB_MIN_SECONDS).
pending_log_chunks: list[str] = []
# Seed bb_phase=1, bb_phase_pct=0 immediately so the
# drawer's per-pattern meters have something to render
# before badblocks emits its first "X% done" line. On a
# 14 TB drive that first line can be several minutes in,
# and a blank meter strip looks broken to the operator.
await _update_stage_bb_phase(
job_id, "surface_validate",
progress.phase, progress.phase_pct,
)
# Stamp phase 1 (write 0xaa) start so the drawer's
# duration history starts populating immediately.
await _record_bb_phase_start(
job_id, "surface_validate", progress.phase, _now(),
)
_push_update()
async def _drain(stream, is_stderr: bool): async def _drain(stream, is_stderr: bool):
nonlocal bad_blocks_total, pid_seen, last_db_write_ts, last_pct_sample nonlocal bad_blocks_total, pid_seen
# Line-based drain. The wrapped badblocks command
# pipes through `tr '\b' '\n'` at the shell level
# so every progress update is a real newline-
# terminated line by the time it reaches us.
async for raw in stream: async for raw in stream:
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace") line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
if not line.strip():
continue
# First stdout line is "PID:<n>" from the # First stdout line is "PID:<n>" from the wrapping shell.
# wrapping shell. Capture and skip. # Capture it and don't append it to the user-visible log.
if not is_stderr and not pid_seen and line.startswith("PID:"): if not is_stderr and not pid_seen and line.startswith("PID:"):
pid_seen = True pid_seen = True
try: try:
@ -580,86 +448,27 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
pass pass
continue continue
# Note: with the `tr` pipe, badblocks's stderr is
# merged into stdout (`2>&1`). is_stderr is now
# always False — we treat every non-PID line as
# potentially containing progress or bad-block
# output. The phase parser is idempotent on
# unrelated lines.
prev_phase = progress.phase
progress.update(line)
phase_changed = progress.phase != prev_phase
is_progress_line = bool(_BB_PERCENT_RE.search(line))
# Bare-number lines from badblocks are bad-block
# block numbers (one per line on stdout).
stripped = line.strip()
if stripped and stripped.isdigit() and not is_progress_line:
bad_blocks_total += 1
# Keep "XX% done" lines OUT of output_lines. Big
# volume + quadratic log_text concat.
if not is_progress_line:
output_lines.append(line) output_lines.append(line)
# Single throttle gate covering EVERY DB touch. if is_stderr:
# Cumulative DB load otherwise overwhelms the m = _re.search(r"([\d.]+)%\s+done", line)
# asyncio loop → asyncssh drain falls behind → if m:
# SSH window stops advancing → pipe fills → pct = min(99, int(float(m.group(1))))
# badblocks blocks on pipe_write → disk I/O stops. await _update_stage_percent(job_id, "surface_validate", pct)
now_ts = time.monotonic() await _update_stage_bad_blocks(job_id, "surface_validate", bad_blocks_total)
time_since_last_db = now_ts - last_db_write_ts
should_write = phase_changed or time_since_last_db >= BB_DB_MIN_SECONDS
if should_write:
if await _is_cancelled(job_id):
await kill.kill_remote_process(job_id)
return
if phase_changed:
await _record_bb_phase_start(
job_id, "surface_validate",
progress.phase, _now(),
)
await _update_stage_percent(
job_id, "surface_validate", progress.overall_pct,
)
await _update_stage_bb_phase(
job_id, "surface_validate",
progress.phase, progress.phase_pct,
)
await _update_stage_bad_blocks(
job_id, "surface_validate", bad_blocks_total,
)
if (
drive_size_bytes
and not phase_changed
and progress.overall_pct > last_pct_sample
and time_since_last_db >= 1.0
):
d_pct = progress.overall_pct - last_pct_sample
bytes_done = (d_pct / 800.0) * drive_size_bytes
mbps = bytes_done / time_since_last_db / 1_000_000
await _update_stage_bb_mbps(
job_id, "surface_validate", mbps,
)
if pending_log_chunks:
chunk = "".join(pending_log_chunks)
pending_log_chunks.clear()
await _append_stage_log(
job_id, "surface_validate", chunk,
)
last_pct_sample = progress.overall_pct
last_db_write_ts = now_ts
await _recalculate_progress(job_id) await _recalculate_progress(job_id)
_push_update() _push_update()
else:
stripped = line.strip()
if stripped and stripped.isdigit():
bad_blocks_total += 1
if not is_progress_line: # Append to DB log in chunks
pending_log_chunks.append(line) if len(output_lines) % 20 == 0:
chunk = "".join(output_lines[-20:])
await _append_stage_log(job_id, "surface_validate", chunk)
# Abort on bad block threshold — immediate. # Abort on bad block threshold
if bad_blocks_total > settings.bad_block_threshold: if bad_blocks_total > settings.bad_block_threshold:
await kill.kill_remote_process(job_id) await kill.kill_remote_process(job_id)
output_lines.append( output_lines.append(
@ -668,9 +477,15 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
) )
return return
# Single stream now — the `2>&1` in _build_badblocks_cmd if await _is_cancelled(job_id):
# merges stderr into stdout before the `tr` pipe. await kill.kill_remote_process(job_id)
await _drain(proc.stdout, False) return
await asyncio.gather(
_drain(proc.stdout, False),
_drain(proc.stderr, True),
return_exceptions=True,
)
# Bound proc.wait so a remote process that ignored our kill # Bound proc.wait so a remote process that ignored our kill
# signal (or that we never managed to kill) can't pin this # signal (or that we never managed to kill) can't pin this
# task in the semaphore forever. Closing the connection on # task in the semaphore forever. Closing the connection on
@ -695,21 +510,7 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
result["aborted"] = bad_blocks_total > settings.bad_block_threshold result["aborted"] = bad_blocks_total > settings.bad_block_threshold
except asyncio.CancelledError: except asyncio.CancelledError:
# Best-effort kill of the remote badblocks process before return False
# propagating the cancel. asyncio.shield() so the kill attempt
# itself isn't interrupted by ongoing loop shutdown. Then
# re-raise so _run_job marks the job 'unknown' (honest about
# the indeterminate outcome) instead of 'failed' (which
# implies the burn-in itself failed, which we don't know).
try:
await asyncio.shield(kill.kill_remote_process(job_id))
except Exception:
pass
await _append_stage_log(
job_id, "surface_validate",
"\n[ABORTED] task cancelled (likely container restart or shutdown)\n",
)
raise
except Exception as exc: except Exception as exc:
await _append_stage_log(job_id, "surface_validate", f"\n[SSH error] {exc}\n") await _append_stage_log(job_id, "surface_validate", f"\n[SSH error] {exc}\n")
await _set_stage_error(job_id, "surface_validate", f"SSH badblocks error: {exc}") await _set_stage_error(job_id, "surface_validate", f"SSH badblocks error: {exc}")

View file

@ -49,10 +49,7 @@ class Settings(BaseSettings):
webhook_url: str = "" webhook_url: str = ""
# Stuck-job detection: jobs running longer than this are marked 'unknown' # Stuck-job detection: jobs running longer than this are marked 'unknown'
# and the remote badblocks/smartctl is killed. 168h (7 days) covers a stuck_job_hours: int = 24
# full -w surface_validate on a 14 TB+ HDD with margin. Older default
# was 24h which false-positived on multi-TB drives almost every time.
stuck_job_hours: int = 168
# Temperature thresholds (°C) — drives table colouring + precheck gate # Temperature thresholds (°C) — drives table colouring + precheck gate
temp_warn_c: int = 46 # orange warning temp_warn_c: int = 46 # orange warning
@ -86,7 +83,7 @@ class Settings(BaseSettings):
ssh_key: str = "" # PEM private key content (paste full key including headers) ssh_key: str = "" # PEM private key content (paste full key including headers)
# Application version — used by the /api/v1/updates/check endpoint # Application version — used by the /api/v1/updates/check endpoint
app_version: str = "1.0.0-60" app_version: str = "1.0.0-41"
# ---- Authentication (1.0.0-22) ---- # ---- Authentication (1.0.0-22) ----
# session_secret: HMAC key for signing session cookies. Empty = generate # session_secret: HMAC key for signing session cookies. Empty = generate

View file

@ -93,24 +93,6 @@ _MIGRATIONS = [
"ALTER TABLE drives ADD COLUMN pool_name TEXT", "ALTER TABLE drives ADD COLUMN pool_name TEXT",
"ALTER TABLE drives ADD COLUMN pool_role TEXT", "ALTER TABLE drives ADD COLUMN pool_role TEXT",
"ALTER TABLE drives ADD COLUMN pool_seen_at TEXT", "ALTER TABLE drives ADD COLUMN pool_seen_at TEXT",
# 1.0.0-44: per-pattern badblocks progress for the drive drawer's
# 4-meter UI. bb_phase is 1-8 (1=write 0xaa, 2=verify 0xaa, 3=write
# 0x55, 4=verify 0x55, 5=write 0xff, 6=verify 0xff, 7=write 0x00,
# 8=verify 0x00). bb_phase_pct is 0-100 within the current phase.
"ALTER TABLE burnin_stages ADD COLUMN bb_phase INTEGER",
"ALTER TABLE burnin_stages ADD COLUMN bb_phase_pct REAL",
# 1.0.0-46: live write/read throughput for the per-pattern meters.
# Computed from successive `XX% done` lines in badblocks output:
# delta_bytes = (overall_pct_delta / 800) * drive_size_bytes.
# Updated on every progress line; NULL until the second progress
# line arrives (need two samples to compute a rate).
"ALTER TABLE burnin_stages ADD COLUMN bb_mbps REAL",
# 1.0.0-47: per-pattern duration history. JSON map of
# {"1": "2026-05-09T05:39:44+00:00", "2": ..., ...} where each key
# is the phase number (1-8) and the value is when the parser first
# observed that phase. Drawer derives "0xaa: 14h 22m" by diffing
# consecutive phase-1 keys.
"ALTER TABLE burnin_stages ADD COLUMN bb_phase_history TEXT",
# 1.0.0-19: enforce one active burn-in per drive at the storage layer. # 1.0.0-19: enforce one active burn-in per drive at the storage layer.
# Closes the read-then-insert race in burnin.start_job — without this, # Closes the read-then-insert race in burnin.start_job — without this,
# two concurrent /api/v1/burnin/start requests for the same drive could # two concurrent /api/v1/burnin/start requests for the same drive could
@ -176,7 +158,6 @@ async def init_db() -> None:
Path(settings.db_path).parent.mkdir(parents=True, exist_ok=True) Path(settings.db_path).parent.mkdir(parents=True, exist_ok=True)
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
await db.executescript(SCHEMA) await db.executescript(SCHEMA)
await _run_migrations(db) await _run_migrations(db)
@ -188,7 +169,6 @@ async def get_db():
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
try: try:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
yield db yield db
finally: finally:

View file

@ -334,7 +334,6 @@ async def _fetch_report_data() -> list[dict]:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
return await _fetch_drives_for_template(db) return await _fetch_drives_for_template(db)
@ -348,7 +347,6 @@ async def _fetch_unlock_events_24h() -> list[dict]:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
# julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format # julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format
# we write from Python; comparing the raw string against # we write from Python; comparing the raw string against
# datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS') # datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS')

View file

@ -189,21 +189,6 @@ class _AuthGateMiddleware(BaseHTTPMiddleware):
await auth.get_user_by_id(int(user_id)) if user_id else None await auth.get_user_by_id(int(user_id)) if user_id else None
) )
# Loopback bypass (1.0.0-56): requests from 127.0.0.1 / ::1
# inside the container skip the auth gate. The only way to hit
# that source IP is a process in the container's network
# namespace — `docker exec` from the host. External traffic
# comes through the docker bridge with a non-loopback source,
# so it still goes through full auth. We read request.client.host
# directly (raw TCP socket), NOT X-Forwarded-For, so external
# attackers can't spoof loopback via headers. This unlocks the
# autonomous monitor's ability to POST /api/v1/burnin/start
# without provisioning a session cookie.
if request.client and request.client.host in ("127.0.0.1", "::1"):
if request.state.current_user is None:
request.state.current_user = auth.LoopbackUser()
return await call_next(request)
if path in _PUBLIC_PATHS or path.startswith(_PUBLIC_PREFIXES): if path in _PUBLIC_PATHS or path.startswith(_PUBLIC_PREFIXES):
return await call_next(request) return await call_next(request)
if request.state.current_user is not None: if request.state.current_user is not None:

View file

@ -437,7 +437,6 @@ async def poll_cycle(client: TrueNASClient) -> int:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
for disk in disks: for disk in disks:
@ -493,7 +492,6 @@ async def run(client: TrueNASClient) -> None:
async with aiosqlite.connect(settings.db_path) as _tdb: async with aiosqlite.connect(settings.db_path) as _tdb:
_tdb.row_factory = aiosqlite.Row _tdb.row_factory = aiosqlite.Row
await _tdb.execute("PRAGMA journal_mode=WAL") await _tdb.execute("PRAGMA journal_mode=WAL")
await _tdb.execute("PRAGMA busy_timeout=60000")
_cur = await _tdb.execute(""" _cur = await _tdb.execute("""
SELECT MAX(d.temperature_c) SELECT MAX(d.temperature_c)
FROM drives d FROM drives d

View file

@ -128,7 +128,6 @@ async def sse_drives(request: Request):
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
drives = await _fetch_drives_for_template(db) drives = await _fetch_drives_for_template(db)
html = templates.env.get_template( html = templates.env.get_template(

View file

@ -147,12 +147,11 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
# For burn-ins that include SMART stages, fetch those stages so we can # For burn-ins that include SMART stages, fetch those stages so we can
# mirror their progress/result in the Short/Long SMART columns. # mirror their progress/result in the Short/Long SMART columns.
# We include burn-ins in ANY state — including failed/passed/cancelled —
# so the SMART columns don't go blank when the burn-in finishes. Without
# this, "FAILED (LONG SMART)" appears in the Burn-In column while the
# Long SMART column shows "—", which contradicts itself.
bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row} bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row}
bi_ids_with_smart = [bi["id"] for bi in burnin_by_drive.values()] bi_ids_with_smart = [
bi["id"] for bi in burnin_by_drive.values()
if bi["state"] in ("running", "queued")
]
if bi_ids_with_smart: if bi_ids_with_smart:
placeholders = ",".join("?" * len(bi_ids_with_smart)) placeholders = ",".join("?" * len(bi_ids_with_smart))
# placeholders is purely structural ("?,?,?"); IDs themselves are # placeholders is purely structural ("?,?,?"); IDs themselves are
@ -164,7 +163,7 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
"FROM burnin_stages bs " "FROM burnin_stages bs "
"WHERE bs.burnin_job_id IN (" + placeholders + ") " "WHERE bs.burnin_job_id IN (" + placeholders + ") "
" AND bs.stage_name IN ('short_smart', 'long_smart') " " AND bs.stage_name IN ('short_smart', 'long_smart') "
" AND bs.state IN ('running', 'passed', 'failed', 'aborted')" " AND bs.state IN ('running', 'passed', 'failed')"
) )
cur = await db.execute(sql, bi_ids_with_smart) cur = await db.execute(sql, bi_ids_with_smart)
for r in await cur.fetchall(): for r in await cur.fetchall():
@ -186,26 +185,14 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
if existing.get("state") not in (None, "idle"): if existing.get("state") not in (None, "idle"):
continue continue
pct = stage["percent"] or 0 pct = stage["percent"] or 0
stage_state = stage["state"]
# If the parent burn-in ended in failure but this SMART
# stage is still recorded as "running", that's an
# orphaned stage row from a hard crash (e.g. the old
# `database is locked` failure mode). Surface as failed
# so the column matches the Burn-In column.
if stage_state == "running" and bi.get("state") in (
"failed", "cancelled", "unknown"
):
stage_state = bi["state"] if bi["state"] != "unknown" else "failed"
d[target] = { d[target] = {
"state": stage_state, "state": stage["state"],
"percent": pct if stage_state == "running" else (100 if stage_state == "passed" else 0), "percent": pct if stage["state"] == "running" else (100 if stage["state"] == "passed" else 0),
"eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage_state == "running" else None, "eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage["state"] == "running" else None,
"eta_timestamp": None, "eta_timestamp": None,
"started_at": stage["started_at"], "started_at": stage["started_at"],
"finished_at": stage["finished_at"], "finished_at": stage["finished_at"],
"error_text": stage["error_text"] or ( "error_text": stage["error_text"],
bi.get("error_text") if stage_state == "failed" else None
),
} }
drives.append(d) drives.append(d)

View file

@ -57,26 +57,11 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
job = dict(job_row) job = dict(job_row)
cur = await db.execute( cur = await db.execute(
"SELECT id, stage_name, state, percent, started_at, finished_at, " "SELECT id, stage_name, state, percent, started_at, finished_at, "
"duration_seconds, error_text, log_text, bad_blocks, " "duration_seconds, error_text, log_text, bad_blocks "
"bb_phase, bb_phase_pct, bb_mbps, bb_phase_history "
"FROM burnin_stages WHERE burnin_job_id=? ORDER BY id", "FROM burnin_stages WHERE burnin_job_id=? ORDER BY id",
(job_row["id"],), (job_row["id"],),
) )
stages = [dict(r) for r in await cur.fetchall()] job["stages"] = [dict(r) for r in await cur.fetchall()]
# Backfill empty stage.error_text from the parent job's error_text
# for any stage that ended in a terminal state without recording
# an error of its own. This catches the orphan pattern from hard
# crashes (DB-locked, SSH disconnect, container restart) where
# the failure didn't get to write a per-stage explanation.
job_err = job.get("error_text")
for s in stages:
if (
s.get("state") in ("failed", "cancelled", "unknown")
and not s.get("error_text")
and job_err
):
s["error_text"] = job_err
job["stages"] = stages
burnin_job = job burnin_job = job
# SMART raw output from smart_tests table # SMART raw output from smart_tests table
@ -121,7 +106,6 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
"serial": drive.serial, "serial": drive.serial,
"model": drive.model, "model": drive.model,
"size_bytes": drive.size_bytes, "size_bytes": drive.size_bytes,
"temperature_c": drive.temperature_c,
}, },
"burnin": burnin_job, "burnin": burnin_job,
"smart": { "smart": {

View file

@ -244,7 +244,7 @@ thead {
} }
th { th {
padding: 6px 8px; padding: 9px 14px;
font-size: 11px; font-size: 11px;
font-weight: 600; font-weight: 600;
text-transform: uppercase; text-transform: uppercase;
@ -256,10 +256,9 @@ th {
} }
td { td {
padding: 7px 8px; padding: 10px 14px;
border-bottom: 1px solid var(--border); border-bottom: 1px solid var(--border);
vertical-align: middle; vertical-align: middle;
line-height: 1.3;
} }
tr:last-child td { tr:last-child td {
@ -277,15 +276,17 @@ tr:hover td {
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Column widths Column widths
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-drive { min-width: 160px; } .col-drive { min-width: 180px; }
.col-serial { min-width: 95px; } .col-serial { min-width: 110px; }
.col-size { min-width: 60px; text-align: right; } .col-size { min-width: 70px; text-align: right; }
.col-temp { min-width: 60px; text-align: right; } .col-temp { min-width: 75px; text-align: right; }
.col-health { min-width: 70px; } .col-health { min-width: 85px; }
.col-smart { min-width: 80px; } .col-smart { min-width: 95px; }
/* Tighter SMART columns — they hold short pills or a progress bar. */ /* Tighter horizontal padding on the SMART columns they hold short
th.col-smart, td.col-smart { padding-left: 5px; padding-right: 5px; } pills ("Passed"/"—") or a progress bar, so the default 14px gutter
.col-actions { min-width: 150px; } wastes space on 13" laptops. */
th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
.col-actions { min-width: 170px; }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Drive cell Drive cell
@ -294,23 +295,14 @@ th.col-smart, td.col-smart { padding-left: 5px; padding-right: 5px; }
display: block; display: block;
font-weight: 500; font-weight: 500;
color: var(--text-strong); color: var(--text-strong);
font-size: 13px; font-size: 14px;
line-height: 1.25;
} }
.drive-model { .drive-model {
display: inline; display: block;
font-size: 10px; font-size: 11px;
color: var(--text-muted); color: var(--text-muted);
margin-top: 0; margin-top: 1px;
line-height: 1.25;
}
/* Separator between model and location when both are present on the
same line. ::after on .drive-model puts a thin dot between them. */
.drive-model + .drive-location::before {
content: " · ";
color: var(--border);
margin: 0 2px;
} }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
@ -433,7 +425,7 @@ th.col-smart, td.col-smart { padding-left: 5px; padding-right: 5px; }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Burn-in column Burn-in column
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-burnin { min-width: 130px; } .col-burnin { min-width: 160px; }
.burnin-cell { min-width: 140px; } .burnin-cell { min-width: 140px; }
@ -1188,9 +1180,9 @@ a.stat-card:hover {
Checkbox column Checkbox column
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-check { .col-check {
width: 32px; width: 36px;
min-width: 32px; min-width: 36px;
padding: 7px 4px 7px 8px; padding: 10px 8px 10px 14px;
} }
.drive-checkbox, #select-all-cb { .drive-checkbox, #select-all-cb {
@ -1204,15 +1196,18 @@ a.stat-card:hover {
Drive location inline edit Drive location inline edit
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.drive-location { .drive-location {
display: inline; display: block;
font-size: 10px; font-size: 10px;
color: var(--text-muted); color: var(--text-muted);
margin-top: 0; margin-top: 2px;
cursor: pointer; cursor: pointer;
border-radius: 3px; border-radius: 3px;
padding: 0 3px; padding: 1px 3px;
line-height: 1.1;
transition: background 0.1s; transition: background 0.1s;
max-width: 160px;
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
} }
.drive-location:hover { background: var(--border); color: var(--text); } .drive-location:hover { background: var(--border); color: var(--text); }
@ -2699,276 +2694,3 @@ tr.drawer-row-active {
font-variant-numeric: tabular-nums; font-variant-numeric: tabular-nums;
} }
/* -----------------------------------------------------------------------
Per-pattern badblocks meters in the drive drawer (1.0.0-44).
Four meters, one per pattern (0xaa / 0x55 / 0xff / 0x00). Each meter
has two halves: write (left) and verify (right), so a glance shows
both which pattern is running and which sub-phase within it.
----------------------------------------------------------------------- */
.bb-meters {
display: grid;
grid-template-columns: repeat(4, 1fr);
gap: 8px;
padding: 10px 12px;
background: var(--bg-soft, #161b22);
border-radius: 6px;
margin: 6px 0 8px 0;
}
.bb-meter {
display: flex;
flex-direction: column;
gap: 4px;
}
.bb-meter-label {
font-family: "SF Mono", "Consolas", monospace;
font-size: 10px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.bb-meter-current .bb-meter-label {
color: var(--blue, #58a6ff);
font-weight: 600;
}
.bb-meter-done .bb-meter-label {
color: var(--green, #3fb950);
}
.bb-meter-bar {
display: flex;
height: 10px;
background: var(--bg, #0d1117);
border: 1px solid var(--border, #30363d);
border-radius: 3px;
overflow: hidden;
position: relative;
}
.bb-meter-half {
height: 100%;
transition: width .3s ease;
}
.bb-write {
background: var(--blue, #58a6ff);
flex: 0 0 auto;
max-width: 50%;
}
.bb-verify {
background: var(--green, #3fb950);
flex: 0 0 auto;
max-width: 50%;
}
.bb-meter-half-spacer {
flex: 0 0 auto;
width: 1px;
background: var(--border, #30363d);
height: 100%;
}
.bb-meter-done .bb-write,
.bb-meter-done .bb-verify {
opacity: .55;
}
.bb-meter-sub {
display: flex;
justify-content: space-between;
font-family: "SF Mono", "Consolas", monospace;
font-size: 9px;
color: var(--text-muted);
}
.bb-sub-write { color: color-mix(in srgb, var(--blue) 80%, var(--text-muted)); }
.bb-sub-verify { color: color-mix(in srgb, var(--green) 80%, var(--text-muted)); }
/* -----------------------------------------------------------------------
Surface-scan vital-signs row in the drawer (1.0.0-46).
Sits directly above the per-pattern meters. Temperature with
green/yellow/red colour, live MB/s, elapsed, ETA all derived
from data already in the drawer payload.
----------------------------------------------------------------------- */
.bb-vitals {
display: flex;
gap: 14px;
flex-wrap: wrap;
padding: 8px 12px 4px 12px;
background: var(--bg-soft, #161b22);
border-radius: 6px 6px 0 0;
margin: 6px 0 0 0;
border-bottom: 1px solid var(--border, #30363d);
}
/* When vitals lead, suppress the meter strip's top radius + margin so
they read as one stacked unit. */
.bb-vitals + .bb-meters {
border-radius: 0 0 6px 6px;
margin-top: 0;
}
.bb-vital {
display: flex;
flex-direction: column;
gap: 1px;
font-family: "SF Mono", "Consolas", monospace;
}
.bb-vital-label {
font-size: 9px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.bb-vital-value {
font-size: 13px;
color: var(--text-strong, #f0f6fc);
font-weight: 500;
font-variant-numeric: tabular-nums;
}
/* -----------------------------------------------------------------------
Phase caption + per-pattern history (1.0.0-47).
----------------------------------------------------------------------- */
.bb-caption {
font-family: "SF Mono", "Consolas", monospace;
font-size: 11px;
color: var(--text-muted);
padding: 6px 12px 0 12px;
letter-spacing: .02em;
}
.bb-history {
display: flex;
flex-wrap: wrap;
align-items: center;
gap: 10px;
padding: 6px 12px 8px 12px;
font-family: "SF Mono", "Consolas", monospace;
font-size: 10px;
color: var(--text-muted);
}
.bb-hist-title {
text-transform: uppercase;
letter-spacing: .04em;
font-size: 9px;
margin-right: 4px;
}
.bb-hist-row {
display: inline-flex;
align-items: baseline;
gap: 4px;
background: var(--bg, #0d1117);
border: 1px solid var(--border, #30363d);
border-radius: 3px;
padding: 1px 6px;
}
.bb-hist-label {
color: var(--green, #3fb950);
font-weight: 600;
}
.bb-hist-dur {
color: var(--text-strong, #f0f6fc);
font-variant-numeric: tabular-nums;
}
/* Bad-block counter colour states inside the vitals row */
.bb-vital-good { color: var(--green, #3fb950); }
.bb-vital-bad { color: var(--red, #f85149); }
/* -----------------------------------------------------------------------
Column sort (1.0.0-48). Click a sortable TH to cycle asc desc
cleared. Indicator arrow appears next to the column label.
----------------------------------------------------------------------- */
th.sortable {
cursor: pointer;
user-select: none;
position: relative;
}
th.sortable:hover { color: var(--text); }
th.sortable::after {
content: "";
display: inline-block;
width: 0;
height: 0;
margin-left: 4px;
border-left: 4px solid transparent;
border-right: 4px solid transparent;
vertical-align: middle;
opacity: 0;
}
th.sortable:hover::after { opacity: 0.4; border-bottom: 5px solid currentColor; }
th.sort-asc::after {
opacity: 1;
border-bottom: 5px solid var(--blue, #58a6ff);
}
th.sort-desc::after {
opacity: 1;
border-top: 5px solid var(--blue, #58a6ff);
}
/* -----------------------------------------------------------------------
Stage "Reason" block explains why a stage ended in a terminal
state. Replaces the old single-line stage-error-line for
failed/cancelled/unknown stages so the operator gets a clear,
prominent explanation at the top.
----------------------------------------------------------------------- */
.stage-reason {
display: flex;
gap: 10px;
align-items: baseline;
padding: 8px 12px;
margin: 6px 0;
border-radius: 5px;
font-size: 12px;
border: 1px solid;
}
.stage-reason-failed {
background: var(--red-bg, color-mix(in srgb, var(--red) 12%, transparent));
border-color: var(--red-bd, color-mix(in srgb, var(--red) 40%, transparent));
}
.stage-reason-cancelled,
.stage-reason-unknown {
background: var(--yellow-bg, color-mix(in srgb, var(--yellow) 12%, transparent));
border-color: var(--yellow-bd, color-mix(in srgb, var(--yellow) 40%, transparent));
}
.stage-reason-label {
font-size: 10px;
text-transform: uppercase;
letter-spacing: .06em;
font-weight: 600;
color: var(--text-muted);
flex-shrink: 0;
}
.stage-reason-text {
flex: 1;
color: var(--text-strong, #f0f6fc);
line-height: 1.4;
word-wrap: break-word;
}
.stage-reason-failed .stage-reason-text { color: var(--red, #f85149); }
.stage-reason-cancelled .stage-reason-text,
.stage-reason-unknown .stage-reason-text { color: var(--yellow, #d29922); }
/* -----------------------------------------------------------------------
Drawer job-level estimated completion (right-aligned in the header,
so it doesn't compete with the state chip + operator info).
----------------------------------------------------------------------- */
.drawer-job-header {
display: flex;
align-items: center;
gap: 10px;
flex-wrap: wrap;
}
.drawer-job-finish {
display: inline-flex;
align-items: baseline;
gap: 8px;
padding: 4px 10px;
background: var(--bg-soft, #161b22);
border: 1px solid var(--border, #30363d);
border-radius: 5px;
font-family: "SF Mono", "Consolas", monospace;
}
.drawer-job-finish-label {
font-size: 9px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.drawer-job-finish-value {
font-size: 12px;
color: var(--text-strong, #f0f6fc);
font-weight: 500;
font-variant-numeric: tabular-nums;
}

View file

@ -79,86 +79,12 @@
initElapsedTimers(); initElapsedTimers();
initUnlockCountdowns(); initUnlockCountdowns();
initLocationEdits(); initLocationEdits();
applySort(); // SSE swap replaces #drives-tbody — re-apply persisted sort
paintSortIndicators();
if (_drawerDriveId) { if (_drawerDriveId) {
_drawerHighlightRow(_drawerDriveId); _drawerHighlightRow(_drawerDriveId);
drawerFetch(_drawerDriveId); drawerFetch(_drawerDriveId);
} }
}); });
// ---------------------------------------------------------------
// Column sorting (client-side, persisted in localStorage so it
// survives reload AND survives every SSE-driven tbody refresh).
// ---------------------------------------------------------------
var SORT_KEY = 'nasburnin.sort';
function getSort() {
try {
var raw = localStorage.getItem(SORT_KEY);
if (!raw) return null;
var p = JSON.parse(raw);
if (p && p.col && (p.dir === 'asc' || p.dir === 'desc')) return p;
} catch (e) {}
return null;
}
function setSort(col, dir) {
if (!col) localStorage.removeItem(SORT_KEY);
else localStorage.setItem(SORT_KEY, JSON.stringify({col: col, dir: dir}));
}
function applySort() {
var s = getSort();
var tbody = document.getElementById('drives-tbody');
if (!tbody || !s) return;
var rows = Array.from(tbody.querySelectorAll('tr[id^="drive-"]'));
if (!rows.length) return;
var attr = 'data-sort-' + s.col;
var dirMul = s.dir === 'asc' ? 1 : -1;
rows.sort(function (a, b) {
var av = a.getAttribute(attr);
var bv = b.getAttribute(attr);
// Empty values always sink to the bottom regardless of direction.
var aEmpty = av === null || av === '';
var bEmpty = bv === null || bv === '';
if (aEmpty && !bEmpty) return 1;
if (!aEmpty && bEmpty) return -1;
if (aEmpty && bEmpty) return 0;
// Numeric comparison if both parse cleanly, else string.
var an = parseFloat(av), bn = parseFloat(bv);
if (!isNaN(an) && !isNaN(bn) && String(an) === av && String(bn) === bv) {
return (an - bn) * dirMul;
}
return av.localeCompare(bv) * dirMul;
});
rows.forEach(function (r) { tbody.appendChild(r); });
}
function paintSortIndicators() {
var s = getSort();
document.querySelectorAll('th.sortable').forEach(function (th) {
th.classList.remove('sort-asc', 'sort-desc');
if (s && th.dataset.sortKey === s.col) {
th.classList.add(s.dir === 'asc' ? 'sort-asc' : 'sort-desc');
}
});
}
document.addEventListener('click', function (e) {
var th = e.target.closest('th.sortable');
if (!th) return;
var col = th.dataset.sortKey;
var s = getSort();
var dir = 'asc';
if (s && s.col === col) {
// Click cycle: asc → desc → cleared
if (s.dir === 'asc') dir = 'desc';
else { setSort(null); applySort(); paintSortIndicators(); return; }
}
setSort(col, dir);
applySort();
paintSortIndicators();
});
// Initial paint on page load (HTML is already rendered server-side).
applySort();
paintSortIndicators();
updateCounts(); updateCounts();
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
@ -1345,14 +1271,8 @@
} }
} }
// Stash the last drive object so the burn-in panel renderer can
// pull temperature_c into the vital-signs row without having to
// pass it through the Burn-In renderer's signature.
var _DRAWER_LAST_DRIVE = null;
function _drawerRender(data) { function _drawerRender(data) {
var drive = data.drive || {}; var drive = data.drive || {};
_DRAWER_LAST_DRIVE = drive;
var devnameEl = document.getElementById('drawer-devname'); var devnameEl = document.getElementById('drawer-devname');
var metaEl = document.getElementById('drawer-drive-meta'); var metaEl = document.getElementById('drawer-drive-meta');
if (devnameEl) devnameEl.textContent = drive.devname || '\u2014'; if (devnameEl) devnameEl.textContent = drive.devname || '\u2014';
@ -1366,170 +1286,6 @@
_drawerRenderEvents(data.events); _drawerRenderEvents(data.events);
} }
// Vital-signs row above the meters: drive temp, live throughput,
// elapsed time, ETA. Computed from data already in the drawer payload.
function _drawerRenderBadblocksVitals(stage, drive) {
var phase = parseInt(stage.bb_phase, 10) || 1;
var phasePct = parseFloat(stage.bb_phase_pct || 0);
var overallPct = ((phase - 1) * 100 + phasePct) / 8; // 0..100
var html = '<div class="bb-vitals">';
var dateOpts = {
weekday: 'short', month: 'short', day: 'numeric',
hour: 'numeric', minute: '2-digit',
};
// Start (wall-clock, with date)
if (stage.started_at) {
var startMs = Date.parse(stage.started_at);
var startStr = new Date(startMs).toLocaleString(undefined, dateOpts);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Start</span>';
html += '<span class="bb-vital-value">' + startStr + '</span>';
html += '</div>';
// Elapsed
var elapsedSec = Math.max(0, (Date.now() - startMs) / 1000);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Elapsed</span>';
html += '<span class="bb-vital-value">' + _bbFmtDuration(elapsedSec) + '</span>';
html += '</div>';
// ETA + Finish — only once we have measurable progress, so the
// first samples don't paint a "47 days" estimate.
if (overallPct >= 0.5) {
var totalSec = elapsedSec * (100 / overallPct);
var remainingSec = Math.max(0, totalSec - elapsedSec);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">ETA</span>';
html += '<span class="bb-vital-value">' + _bbFmtDuration(remainingSec) + '</span>';
html += '</div>';
var finishStr = new Date(Date.now() + remainingSec * 1000)
.toLocaleString(undefined, dateOpts);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Finish</span>';
html += '<span class="bb-vital-value">' + finishStr + '</span>';
html += '</div>';
}
}
// Temp with hot/warm/cool colour
if (drive && typeof drive.temperature_c === 'number') {
var tc = drive.temperature_c;
var tClass = 'temp-cool';
if (tc >= 48) tClass = 'temp-hot';
else if (tc >= 42) tClass = 'temp-warm';
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Temp</span>';
html += '<span class="bb-vital-value temp ' + tClass + '">' + tc + '°C</span>';
html += '</div>';
}
html += '</div>';
return html;
}
function _bbFmtDuration(sec) {
sec = Math.floor(sec);
var d = Math.floor(sec / 86400);
var h = Math.floor((sec % 86400) / 3600);
var m = Math.floor((sec % 3600) / 60);
if (d > 0) return d + 'd ' + h + 'h';
if (h > 0) return h + 'h ' + m + 'm';
return m + 'm';
}
// Phase caption — explicit text below the meters: e.g.
// "Pattern 2 of 4 · Verify 0x55 · 47% within phase".
function _drawerRenderBadblocksCaption(phase, phasePct) {
if (!phase) return '';
var p = parseInt(phase, 10);
var pct = parseFloat(phasePct || 0);
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var pattern = Math.ceil(p / 2);
var subPhase = (p % 2 === 1) ? 'Write' : 'Verify';
var label = labels[pattern - 1];
var html = '<div class="bb-caption">';
html += 'Pattern ' + pattern + ' of 4 · ';
html += subPhase + ' ' + label + ' · ';
html += pct.toFixed(1) + '% within phase';
html += '</div>';
return html;
}
// Per-pattern duration history. Reads bb_phase_history (JSON) and
// emits "0xaa: 14h 22m" rows for completed patterns. Pattern N is
// "complete" when its verify-phase end timestamp is known (= the
// next pattern's write-phase start, or stage.finished_at for the
// final one).
function _drawerRenderBadblocksHistory(stage) {
if (!stage.bb_phase_history) return '';
var hist;
try { hist = JSON.parse(stage.bb_phase_history); }
catch (e) { return ''; }
if (!hist || typeof hist !== 'object') return '';
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var rows = [];
for (var n = 1; n <= 4; n++) {
var writeStart = hist[String(2 * n - 1)];
if (!writeStart) continue;
var endTs = (n < 4) ? hist[String(2 * n + 1)] : stage.finished_at;
if (!endTs) continue;
var elapsedSec = (Date.parse(endTs) - Date.parse(writeStart)) / 1000;
if (elapsedSec <= 0) continue;
rows.push('<span class="bb-hist-row">' +
'<span class="bb-hist-label">' + labels[n - 1] + '</span>' +
'<span class="bb-hist-dur">' + _bbFmtDuration(elapsedSec) + '</span>' +
'</span>');
}
if (!rows.length) return '';
return '<div class="bb-history"><span class="bb-hist-title">Completed patterns</span>' +
rows.join('') + '</div>';
}
// Render 4 pattern meters for badblocks -w surface_validate. Each
// meter splits write/verify halves so you can see at a glance which
// pattern is current AND whether you're writing or verifying within
// it. phase: 1-8 (1=write 0xaa, 2=verify 0xaa, 3=write 0x55, ...).
function _drawerRenderBadblocksMeters(phase, phasePct) {
if (!phase) return '';
var p = parseInt(phase, 10);
var pct = parseFloat(phasePct || 0);
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var html = '<div class="bb-meters">';
for (var i = 0; i < 4; i++) {
var writePhase = i * 2 + 1;
var verifyPhase = writePhase + 1;
var writeFill, verifyFill;
if (p > verifyPhase) {
writeFill = 100; verifyFill = 100;
} else if (p === verifyPhase) {
writeFill = 100; verifyFill = pct;
} else if (p === writePhase) {
writeFill = pct; verifyFill = 0;
} else {
writeFill = 0; verifyFill = 0;
}
var classes = 'bb-meter';
if (p === writePhase || p === verifyPhase) classes += ' bb-meter-current';
if (p > verifyPhase) classes += ' bb-meter-done';
html += '<div class="' + classes + '">';
html += '<div class="bb-meter-label">' + labels[i] + '</div>';
html += '<div class="bb-meter-bar">';
html += '<div class="bb-meter-half bb-write" style="width:' + writeFill.toFixed(1) + '%"></div>';
html += '<div class="bb-meter-half-spacer"></div>';
html += '<div class="bb-meter-half bb-verify" style="width:' + verifyFill.toFixed(1) + '%"></div>';
html += '</div>';
html += '<div class="bb-meter-sub">';
html += '<span class="bb-sub-write">W ' + Math.round(writeFill) + '%</span>';
html += '<span class="bb-sub-verify">V ' + Math.round(verifyFill) + '%</span>';
html += '</div>';
html += '</div>';
}
html += '</div>';
return html;
}
function _drawerRenderBurnin(burnin) { function _drawerRenderBurnin(burnin) {
var panel = document.getElementById('drawer-panel-burnin'); var panel = document.getElementById('drawer-panel-burnin');
if (!panel) return; if (!panel) return;
@ -1544,30 +1300,7 @@
html += '<span class="drawer-job-meta">'; html += '<span class="drawer-job-meta">';
if (burnin.operator) html += 'by ' + _esc(burnin.operator); if (burnin.operator) html += 'by ' + _esc(burnin.operator);
if (burnin.started_at) html += ' \u00b7 ' + _drawerFmtDt(burnin.started_at); if (burnin.started_at) html += ' \u00b7 ' + _drawerFmtDt(burnin.started_at);
html += '</span>'; html += '</span></div>';
// Job-level estimated completion. Uses the weighted overall job %
// (recalculated server-side from stage progress) so it reflects
// every stage, not just the current one. Suppressed under 0.5%
// so the early sample doesn't paint a "Finish: Sep 22" stutter.
if (burnin.state === 'running' && burnin.started_at) {
var jobPct = parseFloat(burnin.percent || 0);
if (jobPct >= 0.5) {
var jobStartMs = Date.parse(burnin.started_at);
var jobElapsedSec = Math.max(0, (Date.now() - jobStartMs) / 1000);
var jobTotalSec = jobElapsedSec * (100 / jobPct);
var jobRemainSec = Math.max(0, jobTotalSec - jobElapsedSec);
var jobFinish = new Date(Date.now() + jobRemainSec * 1000);
var jobFinishStr = jobFinish.toLocaleString(undefined, {
weekday: 'short', month: 'short', day: 'numeric',
hour: 'numeric', minute: '2-digit',
});
html += '<span class="drawer-job-finish" title="Estimated completion of the entire burn-in (all stages)">';
html += '<span class="drawer-job-finish-label">Est. completion</span>';
html += '<span class="drawer-job-finish-value">' + jobFinishStr + '</span>';
html += '</span>';
}
}
html += '</div>';
html += '<div class="drawer-stages">'; html += '<div class="drawer-stages">';
var stages = burnin.stages || []; var stages = burnin.stages || [];
@ -1587,37 +1320,9 @@
html += '<span class="stage-duration">' + _drawerFmtDuration(s.started_at, s.finished_at) + '</span>'; html += '<span class="stage-duration">' + _drawerFmtDuration(s.started_at, s.finished_at) + '</span>';
} }
html += '</div>'; html += '</div>';
// Prominent "Why it failed" block at the top of failed/cancelled/ if (s.error_text) {
// unknown stages. Falls back to a heuristic when no error was
// recorded — e.g. a tiny log + no badblocks progress + terminal
// state means the stage was killed externally (SSH disconnect or
// container restart) before it could record an error.
if (s.state === 'failed' || s.state === 'cancelled' || s.state === 'unknown') {
var reason = s.error_text;
if (!reason) {
var logLen = (s.log_text || '').length;
var noBbProgress = !s.bb_phase || (s.bb_phase === 1 && (parseFloat(s.bb_phase_pct || 0) < 0.1));
if (logLen < 500 && noBbProgress) {
reason = 'Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running.';
} else {
reason = 'No error message recorded.';
}
}
html += '<div class="stage-reason stage-reason-' + _esc(s.state) + '">';
html += '<span class="stage-reason-label">Reason</span>';
html += '<span class="stage-reason-text">' + _esc(reason) + '</span>';
html += '</div>';
} else if (s.error_text) {
html += '<div class="stage-error-line">' + _esc(s.error_text) + '</div>'; html += '<div class="stage-error-line">' + _esc(s.error_text) + '</div>';
} }
// Per-pattern meters for badblocks surface_validate, plus the
// vital-signs row above (temp / speed / elapsed / ETA).
if (s.stage_name === 'surface_validate' && s.bb_phase) {
html += _drawerRenderBadblocksVitals(s, _DRAWER_LAST_DRIVE);
html += _drawerRenderBadblocksMeters(s.bb_phase, s.bb_phase_pct);
html += _drawerRenderBadblocksCaption(s.bb_phase, s.bb_phase_pct);
html += _drawerRenderBadblocksHistory(s);
}
// Raw SSH log output (if available) // Raw SSH log output (if available)
if (s.log_text) { if (s.log_text) {
var logHtml = _esc(s.log_text) var logHtml = _esc(s.log_text)

View file

@ -46,13 +46,7 @@
{%- elif bi.state == 'passed' -%} {%- elif bi.state == 'passed' -%}
<span class="chip chip-passed">Passed</span> <span class="chip chip-passed">Passed</span>
{%- elif bi.state == 'failed' -%} {%- elif bi.state == 'failed' -%}
{# Suppress the stage suffix for SMART + surface_validate stages. <span class="chip chip-failed">Failed{% if bi.stage_name %} ({{ bi.stage_name | replace('_',' ') }}){% endif %}</span>
SMART has its own columns, and surface_validate is the dominant
case so a redundant suffix just adds visual noise. The drawer
shows the per-stage Reason for any digging. Keep the suffix for
precheck / final_check since those are rare enough that the hint
is helpful. #}
<span class="chip chip-failed">Failed{% if bi.stage_name and bi.stage_name not in ('short_smart', 'long_smart', 'surface_validate') %} ({{ bi.stage_name | replace('_',' ') }}){% endif %}</span>
{%- elif bi.state == 'cancelled' -%} {%- elif bi.state == 'cancelled' -%}
<span class="chip chip-aborted">Cancelled</span> <span class="chip chip-aborted">Cancelled</span>
{%- elif bi.state == 'unknown' -%} {%- elif bi.state == 'unknown' -%}
@ -69,14 +63,14 @@
<th class="col-check"> <th class="col-check">
<input type="checkbox" id="select-all-cb" class="drive-cb" title="Select all idle drives"> <input type="checkbox" id="select-all-cb" class="drive-cb" title="Select all idle drives">
</th> </th>
<th class="col-drive sortable" data-sort-key="drive">Drive</th> <th class="col-drive">Drive</th>
<th class="col-serial sortable" data-sort-key="serial">Serial</th> <th class="col-serial">Serial</th>
<th class="col-size sortable" data-sort-key="size">Size</th> <th class="col-size">Size</th>
<th class="col-temp sortable" data-sort-key="temp">Temp</th> <th class="col-temp">Temp</th>
<th class="col-health sortable" data-sort-key="health">Health</th> <th class="col-health">Health</th>
<th class="col-smart sortable" data-sort-key="short">Short SMART</th> <th class="col-smart">Short SMART</th>
<th class="col-smart sortable" data-sort-key="long">Long SMART</th> <th class="col-smart">Long SMART</th>
<th class="col-burnin sortable" data-sort-key="burnin">Burn-In</th> <th class="col-burnin">Burn-In</th>
<th class="col-actions">Actions</th> <th class="col-actions">Actions</th>
</tr> </tr>
</thead> </thead>
@ -95,19 +89,7 @@
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted')) {%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %} or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %} {%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %}
{%- set short_state = drive.smart_short.state if drive.smart_short else 'idle' %} <tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}">
{%- set long_state = drive.smart_long.state if drive.smart_long else 'idle' %}
{%- set burnin_state = drive.burnin.state if drive.burnin else '' %}
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}"
data-sort-drive="{{ drive.devname }}"
data-sort-serial="{{ (drive.serial or '') | lower }}"
data-sort-size="{{ drive.size_bytes or 0 }}"
data-sort-temp="{{ drive.temperature_c if drive.temperature_c is not none else '' }}"
data-sort-health="{{ {'PASSED': 1, 'WARNING': 2, 'FAILED': 3, 'UNKNOWN': 4}.get(drive.smart_health, 9) }}"
data-sort-short="{{ {'running': 1, 'failed': 2, 'aborted': 3, 'passed': 4, 'idle': 5}.get(short_state, 9) }}"
data-sort-long="{{ {'running': 1, 'failed': 2, 'aborted': 3, 'passed': 4, 'idle': 5}.get(long_state, 9) }}"
data-sort-burnin="{{ {'running': 1, 'queued': 2, 'failed': 3, 'unknown': 4, 'cancelled': 5, 'passed': 6}.get(burnin_state, 9) }}"
>
<td class="col-check"> <td class="col-check">
{%- if selectable %} {%- if selectable %}
<input type="checkbox" class="drive-checkbox" data-drive-id="{{ drive.id }}"> <input type="checkbox" class="drive-checkbox" data-drive-id="{{ drive.id }}">

View file

@ -1,125 +0,0 @@
"""Verifies _BadblocksProgress translates per-phase badblocks output
into a monotonic 0-99% overall progress.
`badblocks -w` cycles through 4 patterns × {write, verify} = 8 phases.
Each phase prints "XX% done" relative to its own 0-100 range. Without
this translation the dashboard appeared to "rewind" every ~2 hours
when a new phase started and two drives racing each other could
look 4× apart in displayed progress despite identical hardware.
Run inside the container image so app deps are present.
"""
from __future__ import annotations
import unittest
from app.burnin.stages import _BadblocksProgress
class TestBadblocksProgress(unittest.TestCase):
def test_default_phase_one(self):
"""Before any header, treat as start of pattern-1 write."""
p = _BadblocksProgress()
self.assertEqual(p.phase, 1)
self.assertEqual(p.overall_pct, 0)
def test_pattern_headers_set_phase(self):
"""0xaa→1, 0x55→3, 0xff→5, 0x00→7 (write phases)."""
p = _BadblocksProgress()
for header, want in [
("Testing with pattern 0xaa: ", 1),
("Testing with pattern 0x55: ", 3),
("Testing with pattern 0xff: ", 5),
("Testing with pattern 0x00: ", 7),
]:
p.update(header)
self.assertEqual(p.phase, want, f"after {header!r}")
def test_verify_advances_to_next_phase(self):
"""`Reading and comparing` after `Testing with pattern 0x55`
(phase 3) advances to phase 4."""
p = _BadblocksProgress()
p.update("Testing with pattern 0x55: 100.00% done")
self.assertEqual(p.phase, 3)
p.update("Reading and comparing: 0.00% done")
self.assertEqual(p.phase, 4)
def test_overall_pct_at_phase_boundaries(self):
"""Verify the math at each phase boundary: phase N at 100% =
N * 12.5% overall (clipped to 99 at the end)."""
cases = [
(1, 0.0, 0), # start of run
(1, 100.0, 12), # 100/800 = 12.5
(2, 100.0, 25), # 200/800
(4, 100.0, 50), # 400/800
(7, 100.0, 87), # 700/800
(8, 100.0, 99), # 800/800 → clipped to 99
]
for phase, phase_pct, want in cases:
p = _BadblocksProgress()
p.phase = phase
p.phase_pct = phase_pct
self.assertEqual(
p.overall_pct, want,
f"phase={phase} phase_pct={phase_pct}",
)
def test_realistic_sequence(self):
"""End-to-end: feed a synthetic badblocks output stream and
check the overall percent stays monotonically non-decreasing."""
lines = [
"Testing with pattern 0xaa: ",
"10.00% done, 1:00:00 elapsed. (0/0/0 errors)",
"50.00% done, 5:00:00 elapsed. (0/0/0 errors)",
"99.99% done, 10:00:00 elapsed. (0/0/0 errors)",
"Reading and comparing: ",
"0.00% done, 10:00:01 elapsed. (0/0/0 errors)",
"50.00% done, 12:30:00 elapsed. (0/0/0 errors)",
"Testing with pattern 0x55: ",
"0.00% done, 15:00:00 elapsed. (0/0/0 errors)",
"50.00% done, 17:30:00 elapsed. (0/0/0 errors)",
]
p = _BadblocksProgress()
seen = []
for line in lines:
p.update(line)
seen.append(p.overall_pct)
self.assertEqual(
seen, sorted(seen),
f"progress went backwards: {seen}",
)
# Sanity: by the time we're halfway through pattern-2 write
# (phase 3, 50%), we should report ((3-1)*100 + 50) / 8 = 31%.
self.assertEqual(seen[-1], 31)
def test_drives_at_different_phases_show_different_overall(self):
"""The original bug: two drives at the same per-phase 60%
but different phases used to look identical (both '60%').
Now they correctly diverge."""
slow = _BadblocksProgress()
slow.update("Testing with pattern 0xaa: ")
slow.update("60.00% done")
fast = _BadblocksProgress()
fast.update("Testing with pattern 0xaa: ")
fast.update("99.99% done")
fast.update("Reading and comparing: ")
fast.update("60.00% done")
# slow: 60/800 = 7%; fast: (1*100 + 60)/800 = 20%
self.assertEqual(slow.overall_pct, 7)
self.assertEqual(fast.overall_pct, 20)
def test_unknown_pattern_does_not_crash(self):
"""An unrecognized pattern (e.g. badblocks future versions or
custom patterns) just leaves phase unchanged."""
p = _BadblocksProgress()
p.update("Testing with pattern 0xab: ")
# phase stays at the default 1
self.assertEqual(p.phase, 1)
if __name__ == "__main__":
unittest.main()

View file

@ -1,100 +0,0 @@
"""Verifies _update_stage_bb_phase actually writes to burnin_stages
and the migration adds the columns idempotently.
The drive-drawer's 4-meter UI depends on these columns being populated
on every parser tick. If a future refactor drops the call or breaks
the migration, this test catches it before users see the meters
go blank.
Run inside the container image so app deps are present.
"""
from __future__ import annotations
import os
import tempfile
import unittest
import aiosqlite
async def _setup_db_with_stage() -> str:
fd, path = tempfile.mkstemp(suffix=".db")
os.close(fd)
from app.config import settings
settings.db_path = path
from app.database import init_db
await init_db()
async with aiosqlite.connect(path) as db:
await db.execute(
"INSERT INTO drives "
"(truenas_disk_id, devname, serial, model, size_bytes, "
" temperature_c, smart_health, last_seen_at, last_polled_at) "
"VALUES ('id-1', 'sda', 'SER1', 'TestModel', 14000000000000, "
" 30, 'PASSED', '2026-05-09T00:00:00+00:00', "
" '2026-05-09T00:00:00+00:00')"
)
await db.execute(
"INSERT INTO burnin_jobs "
"(drive_id, profile, state, operator, created_at) "
"VALUES (1, 'surface', 'running', 'op', "
" '2026-05-09T00:00:00+00:00')"
)
await db.execute(
"INSERT INTO burnin_stages "
"(burnin_job_id, stage_name, state) "
"VALUES (1, 'surface_validate', 'running')"
)
await db.commit()
return path
class TestBBPhasePersistence(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self):
self.path = await _setup_db_with_stage()
async def asyncTearDown(self):
try:
os.unlink(self.path)
except OSError:
pass
async def test_columns_exist_after_init(self):
async with aiosqlite.connect(self.path) as db:
cur = await db.execute("PRAGMA table_info(burnin_stages)")
cols = {r[1] for r in await cur.fetchall()}
self.assertIn("bb_phase", cols)
self.assertIn("bb_phase_pct", cols)
async def test_update_writes_phase_and_pct(self):
from app.burnin._common import _update_stage_bb_phase
await _update_stage_bb_phase(1, "surface_validate", 3, 47.5)
async with aiosqlite.connect(self.path) as db:
cur = await db.execute(
"SELECT bb_phase, bb_phase_pct FROM burnin_stages "
"WHERE burnin_job_id=1 AND stage_name='surface_validate'"
)
row = await cur.fetchone()
self.assertEqual(row[0], 3)
self.assertAlmostEqual(row[1], 47.5)
async def test_update_overwrites(self):
"""Each tick should replace the previous value, not accumulate."""
from app.burnin._common import _update_stage_bb_phase
await _update_stage_bb_phase(1, "surface_validate", 1, 10.0)
await _update_stage_bb_phase(1, "surface_validate", 2, 80.0)
async with aiosqlite.connect(self.path) as db:
cur = await db.execute(
"SELECT bb_phase, bb_phase_pct FROM burnin_stages "
"WHERE burnin_job_id=1 AND stage_name='surface_validate'"
)
row = await cur.fetchone()
self.assertEqual(row[0], 2)
self.assertAlmostEqual(row[1], 80.0)
if __name__ == "__main__":
unittest.main()