TrueNAS Burn-In Dashboard v0.9.0 — Live mode, thermal monitoring, adaptive concurrency

Go live against real TrueNAS SCALE 25.10:
- Remove mock-truenas dependency; mount SSH key as Docker secret
- Filter expired disk records from /api/v2.0/disk (expiretime field)
- Route all SMART operations through SSH (SCALE 25.10 removed REST smart/test endpoint)
- Poll drive temperatures via POST /api/v2.0/disk/temperatures (SCALE-specific)
- Store raw smartctl output in smart_tests.raw_output for proof of test execution
- Fix percent-remaining=0 false jump to 100% on test start
- Fix terminal WebSocket: add mounted key file fallback (/run/secrets/ssh_key)
- Fix WebSocket support: uvicorn → uvicorn[standard] (installs websockets)

HBA/system sensor temps on dashboard:
- SSH to TrueNAS and run sensors -j each poll cycle
- Parse coretemp (CPU package) and pch_* (PCH/chipset — storage I/O proxy)
- Render as compact chips in stats bar, color-coded green/yellow/red
- Live updates via new SSE system-sensors event every 12s

Adaptive concurrency signal:
- Thermal pressure indicator in stats bar: hidden when OK, WARM/HOT when running
  burn-in drives hit temp_warn_c / temp_crit_c thresholds
- Thermal gate in burn-in queue: jobs wait up to 3 min before acquiring semaphore
  slot if running drives are already at warning temp; times out and proceeds

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
echoparkbaby 2026-02-27 06:33:36 -05:00
parent b1a0fe6bd5
commit 3e0000528f
23 changed files with 3211 additions and 169 deletions

View file

@ -1,7 +1,7 @@
# TrueNAS Burn-In Dashboard — Project Context
> Drop this file in any new Claude session to resume work with full context.
> Last updated: 2026-02-22 (Stage 6d)
> Last updated: 2026-02-24 (Stage 8)
---
@ -28,7 +28,8 @@ against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138).
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
| 7 | Cut to real TrueNAS | 🔲 future |
| 7 | SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish | ✅ |
| 8 | Live SSH terminal in drawer (xterm.js + asyncssh WebSocket PTY bridge) | ✅ |
---
@ -52,6 +53,8 @@ truenas-burnin/
├── database.py # schema, migrations, init_db(), get_db()
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
├── ssh_client.py # asyncssh client: smartctl parsing, badblocks streaming, test_connection
├── terminal.py # WebSocket ↔ asyncssh PTY bridge for live terminal tab
├── truenas.py # httpx async client with retry (lambda factory pattern)
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
@ -68,12 +71,12 @@ truenas-burnin/
└── templates/
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
├── dashboard.html # stats bar, failed banner, batch bar
├── dashboard.html # stats bar, failed banner, batch bar, log drawer (4 tabs: Burn-In/SMART/Events/Terminal)
├── history.html
├── job_detail.html # + Print/Export button
├── audit.html # audit event log
├── stats.html # analytics: pass rate by model, daily activity
├── settings.html # editable 2-col form: SMTP (left) + Notifications/Behavior/Webhook (right)
├── stats.html # analytics: pass rate by model, daily activity, duration by size, failures by stage
├── settings.html # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
└── components/
├── drives_table.html # checkboxes, elapsed time, location inline edit
@ -129,10 +132,19 @@ burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
started_at, finished_at, error_text)
started_at, finished_at, error_text,
log_text TEXT, -- raw smartctl/badblocks SSH output
bad_blocks INTEGER) -- bad sector count from surface_validate
-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
-- drives columns added by migrations:
-- location TEXT, notes TEXT (Stage 6b)
-- smart_attrs TEXT -- JSON blob of last SMART attribute snapshot (Stage 7)
-- smart_tests columns added by migrations:
-- raw_output TEXT -- raw smartctl -a output (Stage 7)
```
---
@ -194,6 +206,15 @@ All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
| `TEMP_WARN_C` | `46` | Temperature warning threshold (°C) |
| `TEMP_CRIT_C` | `55` | Temperature critical threshold — precheck fails above this |
| `BAD_BLOCK_THRESHOLD` | `0` | Max bad blocks allowed before surface_validate fails (0 = any bad = fail) |
| `APP_VERSION` | `1.0.0-7` | Displayed in header version badge |
| `SSH_HOST` | `` | TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST) |
| `SSH_PORT` | `22` | TrueNAS SSH port |
| `SSH_USER` | `root` | TrueNAS SSH username |
| `SSH_PASSWORD` | `` | TrueNAS SSH password (use key instead for production) |
| `SSH_KEY` | `` | TrueNAS SSH private key PEM string — loaded in-memory, never written to disk |
---
@ -305,27 +326,166 @@ async def burnin_get(job_id: int, ...): ...
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
| `profile` NameError in `_execute_stages` | `_execute_stages` called `_recalculate_progress(job_id, profile)` but `profile` not in scope | Changed to `_recalculate_progress(job_id)` — profile param was unused |
| `app_version` Jinja2 global rendered as function | Set `templates.env.globals["app_version"] = _get_app_version` (callable) | Set to the static string value directly: `= _settings.app_version` |
| All buttons broken (Short/Long/Burn-In/Cancel) | `stages.forEach(function(s){` in `_drawerRenderBurnin` missing closing `});` — JS syntax error prevented entire IIFE from loading | Added missing `});` before `} else {` |
---
## Stage 7 — Cutting to Real TrueNAS (TODO)
## Feature Reference (Stage 7)
### SSH Burn-In Architecture
`ssh_client.py` provides an optional SSH execution layer. When `SSH_HOST` is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When `SSH_HOST` is empty, stages fall back to mock/REST simulation.
**Dual-mode dispatch** — each stage checks `ssh_client.is_configured()`:
```python
if ssh_client.is_configured():
# run smartctl / badblocks over SSH
else:
# simulate with REST API or timed sleep (mock mode)
```
**SSH client capabilities** (`ssh_client.py`):
- `test_connection()``{"ok": bool, "error": str}` — used by Test SSH button
- `get_smart_attributes(devname)` → parse `smartctl -a`, return `{health, raw_output, attributes, warnings, failures}`
- `start_smart_test(devname, test_type)``smartctl -t short|long /dev/{devname}`
- `poll_smart_progress(devname)``smartctl -a` during test; returns `{state, percent_remaining, output}`
- `abort_smart_test(devname)``smartctl -X /dev/{devname}`
- `run_badblocks(devname, on_progress, cancelled_fn)` → streams `badblocks -wsv -b 4096 -p 1`; counts bad sectors from stdout (digit-only lines)
**Key auth pattern** — key is stored as PEM string in settings, never written to disk:
```python
asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)
```
**badblocks streaming** — uses `asyncssh.create_process()` with parallel stdout/stderr draining via `asyncio.gather`. Progress updates written to DB every 20 lines to avoid excessive writes.
### SMART Attribute Monitoring
Monitored attributes and their thresholds:
| ID | Name | Any non-zero → |
|----|------|----------------|
| 5 | Reallocated_Sector_Ct | FAIL |
| 10 | Spin_Retry_Count | WARN |
| 188 | Command_Timeout | WARN |
| 197 | Current_Pending_Sector | FAIL |
| 198 | Offline_Uncorrectable | FAIL |
| 199 | UDMA_CRC_Error_Count | WARN |
SMART attrs stored as JSON blob in `drives.smart_attrs`. Updated by `final_check` stage (SSH mode) or `short_smart`/`long_smart` REST mode. Displayed in drive drawer with colour-coded table + raw `smartctl -a` output.
### Drive Reset Action
- `POST /api/v1/drives/{drive_id}/reset` — clears `smart_tests` rows to idle, clears `drives.smart_attrs`, writes audit event, notifies SSE subscribers
- Button appears in action column when `can_reset` = drive has no active burn-in AND has any non-idle smart state or smart attrs
- Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state
### New Routes (Stage 7)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/drives/{id}/reset` | Reset SMART state and attrs for a drive |
| `POST` | `/api/v1/settings/test-ssh` | Test SSH connection with current SSH settings |
| `GET` | `/api/v1/updates/check` | Check for latest release from Forgejo git.hellocomputer.xyz |
### Check for Updates
Settings page has a "Check for Updates" button that fetches:
```
GET https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest
```
Compares tag name against `settings.app_version`; shows "up to date" or "v{tag} available".
### Version Badge
`app_version` set as Jinja2 global in `renderer.py`:
```python
templates.env.globals["app_version"] = _settings.app_version
```
Displayed in header as `<span class="header-version">v{app_version}</span>` (right side, muted).
### Configurable Thresholds
`renderer.py` `_temp_class` now reads from settings instead of hardcoded values:
```python
if temp >= settings.temp_crit_c: return "temp-crit"
if temp >= settings.temp_warn_c: return "temp-warn"
```
`precheck` stage fails if `temperature_c >= settings.temp_crit_c`.
Surface validate fails if `bad_blocks > settings.bad_block_threshold` (default 0 = any bad sector = fail).
## Feature Reference (Stage 8)
### Live Terminal
A full PTY SSH terminal embedded in the log drawer as a fourth tab ("Terminal"). Requires SSH to be configured in Settings.
**Architecture:**
```
Browser (xterm.js) ──WS binary──▶ /ws/terminal (FastAPI WebSocket)
terminal.py handle()
asyncssh.connect() → create_process(term_type="xterm-256color")
asyncio tasks: ssh_to_ws() + ws_to_ssh()
```
**Message protocol** (client ↔ server):
- Client → server **binary**: raw keyboard input bytes forwarded to SSH stdin
- Client → server **text**: JSON control message — only `{"type":"resize","cols":N,"rows":N}` used currently
- Server → client **binary**: raw terminal output bytes from SSH stdout
**`app/terminal.py`** — `handle(ws)`:
1. Guard: `ssh_host` must be set; key or password must be present
2. `asyncssh.connect(known_hosts=None)` with key loaded via `import_private_key()` (never written to disk)
3. `conn.create_process(term_type="xterm-256color", term_size=(80,24), encoding=None)` — opens shell PTY
4. Two asyncio tasks bridging the streams; `asyncio.wait(FIRST_COMPLETED)` + cancel pending on disconnect
5. ANSI-formatted status messages for connect/error states
**Frontend (app.js):**
- xterm.js 5.3.0 + xterm-addon-fit 0.8.0 loaded **lazily** on first Terminal tab click (CDN, ~300KB — not loaded until needed)
- `_termInit()` creates Terminal + FitAddon, opens into the panel div, registers `onData` once
- `ResizeObserver` on the panel → `fit()` + sends `resize` JSON to server
- `_termConnect()` called on init and by Reconnect button — guards against double-connect with `readyState <= 1` check
- `onData` always writes to current `_termWs` by reference — multiple reconnects don't add duplicate handlers
- Reconnect bar floats over terminal on `ws.onclose`; removed on `ws.onopen`
**Tab lifecycle:**
- Terminal tab click → `openTerminalTab()`: loads libs → `_termInit()``_termConnect()` on first open; just refits on subsequent opens
- Autoscroll label hidden when terminal tab is active (not applicable)
- WebSocket stays alive when drawer closes — shell persists until page unload or explicit disconnect
**New route:**
| Method | Path | Description |
|--------|------|-------------|
| `WS` | `/ws/terminal` | asyncssh PTY bridge |
**Config used:** `ssh_host`, `ssh_port`, `ssh_user`, `ssh_key`, `ssh_password` — same SSH settings as burn-in stages.
**xterm.js theme:** GitHub Dark color palette (matches app dark theme). `scrollback: 2000`. Font: SF Mono / Fira Code / Consolas.
### Cutting to Real TrueNAS (Next Steps)
When ready to test against a real TrueNAS CORE box:
1. In `.env` on maple.local, set:
```env
TRUENAS_BASE_URL=https://10.0.0.203 # or whatever your TrueNAS IP is
TRUENAS_API_KEY=your-real-key-here
TRUENAS_VERIFY_TLS=false # unless you have a valid cert
```
2. Comment out `mock-truenas` service in `docker-compose.yml` (or leave it running — harmless)
3. Verify TrueNAS CORE v2.0 API contract matches what `truenas.py` expects:
1. In Settings (or `.env`), set:
- **TrueNAS URL**`https://10.0.0.X` (real IP)
- **API Key** → real API key
- **SSH Host** → same IP as TrueNAS
- **SSH User**`root` (or sudoer with smartctl/badblocks access)
- **SSH Key** → paste PEM key into textarea
2. Click **Test SSH Connection** to verify before starting a burn-in
3. TrueNAS CORE uses `ada0`, `da0` device names (not `sda`). Mock drive names will differ.
4. Delete `app.db` before first real poll to clear mock drive rows
5. Comment out `mock-truenas` service in `docker-compose.yml` (optional — harmless to leave)
6. Verify TrueNAS CORE v2.0 REST API:
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
4. Check that disk names match expected format (TrueNAS CORE uses `ada0`, `da0`, etc. — not `sda`)
- You may need to update mock drive names back or adjust poller logic
5. Delete `app.db` to clear mock drive rows before first real poll
---

View file

@ -1,6 +1,6 @@
# TrueNAS Burn-In — Project Specification
**Version:** 0.5.0
**Version:** 1.0.0-8
**Status:** Active Development
**Audience:** Public / Open Source
@ -49,7 +49,7 @@ badblocks -wsv -b 4096 -p 1 /dev/sdX
```
This is a **destructive write test**. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The `-w` flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.
**Failure threshold:** 2 or more bad blocks found triggers immediate abort and FAILED status. The threshold should be configurable in Settings (default: 2).
**Failure threshold:** Any bad blocks found triggers immediate abort and FAILED status by default. The threshold is configurable in Settings (`Bad Block Threshold`, default: 0 — meaning any bad sector = fail).
---
@ -97,10 +97,11 @@ A **Reset** action clears the test state for a drive so it can be re-queued. It
Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.
Three tabs:
- **badblocks** — live tail of badblocks stdout, including error lines with sector numbers highlighted in red.
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted.
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, etc.).
Four tabs:
- **Burn-In** — stage-by-stage progress for the latest burn-in job; shows live elapsed time, raw SSH log output (smartctl / badblocks), and bad block count.
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted (green/yellow/red). Raw `smartctl -a` output also shown when SSH mode is active.
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, reset, etc.).
- **Terminal** — live SSH PTY session (xterm.js). Opens an interactive shell on the TrueNAS host. Requires SSH to be configured in Settings. Supports full colour, resize, paste, and reconnect. xterm.js is loaded lazily on first use.
Features:
- Auto-scroll toggle (on by default).
@ -233,6 +234,7 @@ Key endpoints:
- `POST /api/v1/burnin/start` — start a burn-in job.
- `POST /api/v1/burnin/{job_id}/cancel` — cancel a burn-in job.
- `GET /sse/drives` — Server-Sent Events stream powering the real-time dashboard UI.
- `WS /ws/terminal` — WebSocket endpoint bridging xterm.js to an asyncssh PTY on TrueNAS.
- `GET /health` — health check endpoint.
The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.
@ -280,9 +282,8 @@ To validate against real hardware:
## Version
- App version starts at **0.5.0**
- Displayed on the dashboard landing page header and in Settings.
- Update check in Settings queries GitHub releases API.
- App version: **1.0.0-8** (displayed in header next to the title, and in Settings).
- Update check in Settings queries Forgejo releases API (`git.hellocomputer.xyz`).
- API version tracked separately, currently **0.1.0**.
---

View file

@ -206,10 +206,45 @@ async def cancel_job(job_id: int, operator: str) -> bool:
# Job runner
# ---------------------------------------------------------------------------
async def _thermal_gate_ok() -> bool:
"""True if it's thermally safe to start a new burn-in.
Checks the peak temperature of drives currently under active burn-in.
"""
try:
async with _db() as db:
cur = await db.execute("""
SELECT MAX(d.temperature_c)
FROM drives d
JOIN burnin_jobs bj ON bj.drive_id = d.id
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
""")
row = await cur.fetchone()
max_temp = row[0] if row and row[0] is not None else None
return max_temp is None or max_temp < settings.temp_warn_c
except Exception:
return True # Never block on error
async def _run_job(job_id: int) -> None:
"""Acquire semaphore slot, execute all stages, persist final state."""
assert _semaphore is not None, "burnin.init() not called"
# Adaptive thermal gate: wait before competing for a slot if running drives
# are already at or above the warning threshold. This prevents layering a
# new burn-in on top of a thermally-stressed system. Gives up after 3 min
# and proceeds anyway so jobs don't queue indefinitely.
for _attempt in range(18): # 18 × 10 s = 3 min max
if await _thermal_gate_ok():
break
if _attempt == 0:
log.info(
"Thermal gate: job %d waiting — running drive temps at or above %d°C",
job_id, settings.temp_warn_c,
)
await asyncio.sleep(10)
else:
log.warning("Thermal gate timed out for job %d — proceeding anyway", job_id)
async with _semaphore:
if await _is_cancelled(job_id):
return
@ -303,6 +338,16 @@ async def _run_job(job_id: int) -> None:
)
job_row = await cur2.fetchone()
if job_row:
# Get bad_blocks count from surface_validate stage if present
bad_blocks = 0
async with _db() as db3:
cur3 = await db3.execute(
"SELECT bad_blocks FROM burnin_stages WHERE burnin_job_id=? AND stage_name='surface_validate'",
(job_id,)
)
bb_row = await cur3.fetchone()
if bb_row and bb_row[0]:
bad_blocks = bb_row[0]
asyncio.create_task(notifier.notify_job_complete(
job_id=job_id,
devname=devname,
@ -312,6 +357,7 @@ async def _run_job(job_id: int) -> None:
profile=job_row["profile"],
operator=job_row["operator"],
error_text=error_text,
bad_blocks=bad_blocks,
))
except Exception as exc:
log.error("Failed to schedule notifications: %s", exc)
@ -339,7 +385,7 @@ async def _execute_stages(job_id: int, stages: list[str], devname: str, drive_id
await _cancel_stage(job_id, stage_name)
else:
await _finish_stage(job_id, stage_name, success=ok)
await _recalculate_progress(job_id, profile)
await _recalculate_progress(job_id)
_push_update()
if not ok:
@ -352,15 +398,15 @@ async def _dispatch_stage(job_id: int, stage_name: str, devname: str, drive_id:
if stage_name == "precheck":
return await _stage_precheck(job_id, drive_id)
elif stage_name == "short_smart":
return await _stage_smart_test(job_id, devname, "SHORT", "short_smart")
return await _stage_smart_test(job_id, devname, "SHORT", "short_smart", drive_id)
elif stage_name == "long_smart":
return await _stage_smart_test(job_id, devname, "LONG", "long_smart")
return await _stage_smart_test(job_id, devname, "LONG", "long_smart", drive_id)
elif stage_name == "surface_validate":
return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds)
return await _stage_surface_validate(job_id, devname, drive_id)
elif stage_name == "io_validate":
return await _stage_timed_simulate(job_id, "io_validate", settings.io_validate_seconds)
elif stage_name == "final_check":
return await _stage_final_check(job_id, devname)
return await _stage_final_check(job_id, devname, drive_id)
return True
@ -385,16 +431,25 @@ async def _stage_precheck(job_id: int, drive_id: int) -> bool:
await _set_stage_error(job_id, "precheck", "Drive SMART health is FAILED — refusing to burn in")
return False
if temp and temp > 60:
await _set_stage_error(job_id, "precheck", f"Drive temperature {temp}°C exceeds 60°C limit")
if temp and temp > settings.temp_crit_c:
await _set_stage_error(job_id, "precheck", f"Drive temperature {temp}°C exceeds {settings.temp_crit_c}°C limit")
return False
await asyncio.sleep(1) # Simulate brief check
return True
async def _stage_smart_test(job_id: int, devname: str, test_type: str, stage_name: str) -> bool:
"""Start a TrueNAS SMART test and poll until complete."""
async def _stage_smart_test(job_id: int, devname: str, test_type: str, stage_name: str,
drive_id: int | None = None) -> bool:
"""Start a SMART test. Uses SSH if configured, TrueNAS REST API otherwise."""
from app import ssh_client
if ssh_client.is_configured():
return await _stage_smart_test_ssh(job_id, devname, test_type, stage_name, drive_id)
return await _stage_smart_test_api(job_id, devname, test_type, stage_name)
async def _stage_smart_test_api(job_id: int, devname: str, test_type: str, stage_name: str) -> bool:
"""TrueNAS REST API path for SMART test (mock / dev mode)."""
tn_job_id = await _client.start_smart_test([devname], test_type)
while True:
@ -428,8 +483,349 @@ async def _stage_smart_test(job_id: int, devname: str, test_type: str, stage_nam
await asyncio.sleep(POLL_INTERVAL)
async def _stage_smart_test_ssh(job_id: int, devname: str, test_type: str, stage_name: str,
drive_id: int | None) -> bool:
"""SSH path for SMART test — runs smartctl directly on TrueNAS."""
from app import ssh_client
# Start the test
try:
startup = await ssh_client.start_smart_test(devname, test_type)
await _append_stage_log(job_id, stage_name, startup + "\n")
except Exception as exc:
await _set_stage_error(job_id, stage_name, f"Failed to start SMART test via SSH: {exc}")
return False
# Brief pause to let the test register in smartctl output
await asyncio.sleep(3)
# Poll until complete
while True:
if await _is_cancelled(job_id):
try:
await ssh_client.abort_smart_test(devname)
except Exception:
pass
return False
await asyncio.sleep(POLL_INTERVAL)
try:
progress = await ssh_client.poll_smart_progress(devname)
except Exception as exc:
log.warning("SSH SMART poll failed: %s", exc, extra={"job_id": job_id})
await _append_stage_log(job_id, stage_name, f"[poll error] {exc}\n")
continue
await _append_stage_log(job_id, stage_name, progress["output"] + "\n---\n")
if progress["state"] == "running":
pct = max(0, 100 - progress["percent_remaining"])
await _update_stage_percent(job_id, stage_name, pct)
await _recalculate_progress(job_id)
_push_update()
elif progress["state"] == "passed":
await _update_stage_percent(job_id, stage_name, 100)
# Run attribute check
if drive_id is not None:
try:
attrs = await ssh_client.get_smart_attributes(devname)
await _store_smart_attrs(drive_id, attrs)
await _store_smart_raw_output(drive_id, test_type, attrs["raw_output"])
if attrs["failures"]:
error = "SMART attribute failures: " + "; ".join(attrs["failures"])
await _set_stage_error(job_id, stage_name, error)
return False
if attrs["warnings"]:
await _append_stage_log(
job_id, stage_name,
"[WARNING] " + "; ".join(attrs["warnings"]) + "\n"
)
except Exception as exc:
log.warning("Failed to retrieve SMART attributes: %s", exc)
await _recalculate_progress(job_id)
_push_update()
return True
elif progress["state"] == "failed":
await _set_stage_error(job_id, stage_name, f"SMART {test_type} test failed")
return False
# "unknown" → keep polling
async def _badblocks_available() -> bool:
"""Check if badblocks is installed on the remote host (Linux/SCALE only)."""
from app import ssh_client
try:
async with await ssh_client._connect() as conn:
result = await conn.run("which badblocks", check=False)
return result.returncode == 0
except Exception:
return False
async def _stage_surface_validate(job_id: int, devname: str, drive_id: int) -> bool:
"""
Surface validation stage auto-routes to the right implementation:
1. SSH configured + badblocks available (TrueNAS SCALE / Linux):
runs badblocks -wsv -b 4096 -p 1 /dev/{devname} directly over SSH.
2. SSH configured + badblocks NOT available (TrueNAS CORE / FreeBSD):
uses TrueNAS REST API disk.wipe FULL job + post-wipe SMART check.
3. No SSH:
simulated timed progress (dev/mock mode).
"""
from app import ssh_client
if ssh_client.is_configured():
if await _badblocks_available():
return await _stage_surface_validate_ssh(job_id, devname, drive_id)
# TrueNAS CORE/FreeBSD: badblocks not available — use native wipe API
await _append_stage_log(
job_id, "surface_validate",
"[INFO] badblocks not found on host (TrueNAS CORE/FreeBSD) — "
"using TrueNAS disk.wipe API (FULL write pass).\n\n"
)
return await _stage_surface_validate_truenas(job_id, devname, drive_id)
return await _stage_timed_simulate(job_id, "surface_validate", settings.surface_validate_seconds)
async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int) -> bool:
"""Run badblocks over SSH, streaming output to stage log."""
from app import ssh_client
await _append_stage_log(
job_id, "surface_validate",
f"[START] badblocks -wsv -b 4096 -p 1 /dev/{devname}\n"
f"[NOTE] This is a DESTRUCTIVE write test. All data on /dev/{devname} will be overwritten.\n\n"
)
def _is_cancelled_sync() -> bool:
# Synchronous version — we check the DB state flag set by cancel_job()
import asyncio
loop = asyncio.get_event_loop()
try:
return loop.run_until_complete(_is_cancelled(job_id))
except Exception:
return False
last_logged_pct = [-1]
def on_progress(pct: int, bad_blocks: int, line: str) -> None:
nonlocal last_logged_pct
# Write to log (fire-and-forget via asyncio.create_task from sync context)
# The log append is done in the async flush below
pass
accumulated_lines: list[str] = []
async def on_progress_async(pct: int, bad_blocks: int, line: str) -> None:
accumulated_lines.append(line)
# Flush to DB and update progress every ~25 lines to avoid excessive DB writes
if len(accumulated_lines) % 25 == 0:
await _append_stage_log(job_id, "surface_validate", "".join(accumulated_lines[-25:]))
await _update_stage_bad_blocks(job_id, "surface_validate", bad_blocks)
await _update_stage_percent(job_id, "surface_validate", pct)
await _recalculate_progress(job_id)
_push_update()
if await _is_cancelled(job_id):
raise asyncio.CancelledError
# Run badblocks — we adapt the callback pattern to async by collecting then flushing
result = {"bad_blocks": 0, "output": "", "aborted": False}
try:
# The actual streaming; we handle progress via the accumulated_lines pattern
bad_blocks_total = 0
output_lines: list[str] = []
async with await ssh_client._connect() as conn:
cmd = f"badblocks -wsv -b 4096 -p 1 /dev/{devname}"
async with conn.create_process(cmd) as proc:
import re as _re
async def _drain(stream, is_stderr: bool):
nonlocal bad_blocks_total
async for raw in stream:
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
output_lines.append(line)
if is_stderr:
m = _re.search(r"([\d.]+)%\s+done", line)
if m:
pct = min(99, int(float(m.group(1))))
await _update_stage_percent(job_id, "surface_validate", pct)
await _update_stage_bad_blocks(job_id, "surface_validate", bad_blocks_total)
await _recalculate_progress(job_id)
_push_update()
else:
stripped = line.strip()
if stripped and stripped.isdigit():
bad_blocks_total += 1
# Append to DB log in chunks
if len(output_lines) % 20 == 0:
chunk = "".join(output_lines[-20:])
await _append_stage_log(job_id, "surface_validate", chunk)
# Abort on bad block threshold
if bad_blocks_total > settings.bad_block_threshold:
proc.kill()
output_lines.append(
f"\n[ABORTED] {bad_blocks_total} bad block(s) exceeded "
f"threshold ({settings.bad_block_threshold})\n"
)
return
if await _is_cancelled(job_id):
proc.kill()
return
await asyncio.gather(
_drain(proc.stdout, False),
_drain(proc.stderr, True),
return_exceptions=True,
)
await proc.wait()
# Flush remaining output
remainder = "".join(output_lines)
await _append_stage_log(job_id, "surface_validate", remainder)
result["bad_blocks"] = bad_blocks_total
result["output"] = remainder
result["aborted"] = bad_blocks_total > settings.bad_block_threshold
except asyncio.CancelledError:
return False
except Exception as exc:
await _append_stage_log(job_id, "surface_validate", f"\n[SSH error] {exc}\n")
await _set_stage_error(job_id, "surface_validate", f"SSH badblocks error: {exc}")
return False
await _update_stage_bad_blocks(job_id, "surface_validate", result["bad_blocks"])
if result["aborted"] or result["bad_blocks"] > settings.bad_block_threshold:
await _set_stage_error(
job_id, "surface_validate",
f"Surface validate FAILED: {result['bad_blocks']} bad block(s) found "
f"(threshold: {settings.bad_block_threshold})"
)
return False
return True
async def _stage_surface_validate_truenas(job_id: int, devname: str, drive_id: int) -> bool:
"""
Surface validation via TrueNAS CORE disk.wipe REST API.
Used on FreeBSD (TrueNAS CORE) where badblocks is unavailable.
Sends a FULL write-zero pass across the entire disk, polls progress,
then runs a post-wipe SMART attribute check to catch reallocated sectors.
"""
from app import ssh_client
await _append_stage_log(
job_id, "surface_validate",
f"[START] TrueNAS disk.wipe FULL — {devname}\n"
f"[NOTE] DESTRUCTIVE: all data on {devname} will be overwritten.\n\n"
)
# Start the wipe job
try:
tn_job_id = await _client.wipe_disk(devname, "FULL")
except Exception as exc:
await _set_stage_error(job_id, "surface_validate", f"Failed to start disk.wipe: {exc}")
return False
await _append_stage_log(
job_id, "surface_validate",
f"[JOB] TrueNAS wipe job started (job_id={tn_job_id})\n"
)
# Poll until complete
log_flush_counter = 0
while True:
if await _is_cancelled(job_id):
try:
await _client.abort_job(tn_job_id)
except Exception:
pass
return False
await asyncio.sleep(POLL_INTERVAL)
try:
job = await _client.get_job(tn_job_id)
except Exception as exc:
log.warning("Wipe job poll failed: %s", exc, extra={"job_id": job_id})
await _append_stage_log(job_id, "surface_validate", f"[poll error] {exc}\n")
continue
if not job:
await _set_stage_error(job_id, "surface_validate", f"Wipe job {tn_job_id} not found")
return False
state = job.get("state", "")
pct = int(job.get("progress", {}).get("percent", 0) or 0)
desc = job.get("progress", {}).get("description", "")
await _update_stage_percent(job_id, "surface_validate", min(pct, 99))
await _recalculate_progress(job_id)
_push_update()
# Log progress description every ~5 polls to avoid DB spam
log_flush_counter += 1
if desc and log_flush_counter % 5 == 0:
await _append_stage_log(job_id, "surface_validate", f"[{pct}%] {desc}\n")
if state == "SUCCESS":
await _update_stage_percent(job_id, "surface_validate", 100)
await _append_stage_log(
job_id, "surface_validate",
f"\n[DONE] Wipe job {tn_job_id} completed successfully.\n"
)
# Post-wipe SMART check — catch any sectors that failed under write stress
if ssh_client.is_configured() and drive_id is not None:
await _append_stage_log(
job_id, "surface_validate",
"[CHECK] Running post-wipe SMART attribute check...\n"
)
try:
attrs = await ssh_client.get_smart_attributes(devname)
await _store_smart_attrs(drive_id, attrs)
if attrs["failures"]:
error = "Post-wipe SMART check: " + "; ".join(attrs["failures"])
await _set_stage_error(job_id, "surface_validate", error)
return False
if attrs["warnings"]:
await _append_stage_log(
job_id, "surface_validate",
"[WARNING] " + "; ".join(attrs["warnings"]) + "\n"
)
await _append_stage_log(
job_id, "surface_validate",
f"[CHECK] SMART health: {attrs['health']} — no critical attributes.\n"
)
except Exception as exc:
log.warning("Post-wipe SMART check failed: %s", exc)
await _append_stage_log(
job_id, "surface_validate",
f"[WARN] Post-wipe SMART check failed (non-fatal): {exc}\n"
)
return True
elif state in ("FAILED", "ABORTED", "ERROR"):
error_msg = job.get("error") or f"Disk wipe failed (state={state})"
await _set_stage_error(
job_id, "surface_validate",
f"TrueNAS disk.wipe FAILED: {error_msg}"
)
return False
# RUNNING or WAITING — keep polling
async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds: int) -> bool:
"""Simulate a timed stage (surface validation / IO validation) with progress updates."""
"""Simulate a timed stage with progress updates (mock / dev mode)."""
start = time.monotonic()
while True:
@ -449,9 +845,28 @@ async def _stage_timed_simulate(job_id: int, stage_name: str, duration_seconds:
await asyncio.sleep(POLL_INTERVAL)
async def _stage_final_check(job_id: int, devname: str) -> bool:
"""Verify drive passed all tests by checking current SMART health in DB."""
async def _stage_final_check(job_id: int, devname: str, drive_id: int | None = None) -> bool:
"""
Verify drive passed all tests.
SSH mode: run smartctl -a and check critical attributes.
Mock mode: check SMART health field in DB.
"""
await asyncio.sleep(1)
from app import ssh_client
if ssh_client.is_configured() and drive_id is not None:
try:
attrs = await ssh_client.get_smart_attributes(devname)
await _store_smart_attrs(drive_id, attrs)
if attrs["health"] == "FAILED" or attrs["failures"]:
failures = attrs["failures"] or [f"SMART health: {attrs['health']}"]
await _set_stage_error(job_id, "final_check",
"Final check failed: " + "; ".join(failures))
return False
return True
except Exception as exc:
log.warning("SSH final_check failed, falling back to DB check: %s", exc)
# DB check (mock mode fallback)
async with _db() as db:
cur = await db.execute(
"SELECT smart_health FROM drives WHERE devname=?", (devname,)
@ -549,6 +964,57 @@ async def _cancel_stage(job_id: int, stage_name: str) -> None:
await db.commit()
async def _append_stage_log(job_id: int, stage_name: str, text: str) -> None:
"""Append text to the log_text column of a burnin_stages row."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"""UPDATE burnin_stages
SET log_text = COALESCE(log_text, '') || ?
WHERE burnin_job_id=? AND stage_name=?""",
(text, job_id, stage_name),
)
await db.commit()
async def _update_stage_bad_blocks(job_id: int, stage_name: str, count: int) -> None:
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE burnin_stages SET bad_blocks=? WHERE burnin_job_id=? AND stage_name=?",
(count, job_id, stage_name),
)
await db.commit()
async def _store_smart_attrs(drive_id: int, attrs: dict) -> None:
"""Persist latest SMART attribute dict to drives.smart_attrs (JSON)."""
import json
# Convert int keys to str for JSON serialisation
serialisable = {str(k): v for k, v in attrs.get("attributes", {}).items()}
blob = json.dumps({
"health": attrs.get("health", "UNKNOWN"),
"attrs": serialisable,
"warnings": attrs.get("warnings", []),
"failures": attrs.get("failures", []),
})
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute("UPDATE drives SET smart_attrs=? WHERE id=?", (blob, drive_id))
await db.commit()
async def _store_smart_raw_output(drive_id: int, test_type: str, raw: str) -> None:
"""Store raw smartctl output in smart_tests.raw_output."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE smart_tests SET raw_output=? WHERE drive_id=? AND test_type=?",
(raw, drive_id, test_type.lower()),
)
await db.commit()
async def _set_stage_error(job_id: int, stage_name: str, error_text: str) -> None:
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")

View file

@ -51,5 +51,24 @@ class Settings(BaseSettings):
# Stuck-job detection: jobs running longer than this are marked 'unknown'
stuck_job_hours: int = 24
# Temperature thresholds (°C) — drives table colouring + precheck gate
temp_warn_c: int = 46 # orange warning
temp_crit_c: int = 55 # red critical (precheck refuses to start above this)
# Bad-block tolerance — surface_validate fails if bad blocks exceed this
bad_block_threshold: int = 0
# SSH credentials for direct TrueNAS command execution (Stage 7)
# When ssh_host is set, burn-in stages use SSH for smartctl/badblocks instead of REST API.
# Leave ssh_host empty to use the mock/REST API (development mode).
ssh_host: str = ""
ssh_port: int = 22
ssh_user: str = "root" # TrueNAS CORE default is root
ssh_password: str = "" # Password auth (leave blank if using key)
ssh_key: str = "" # PEM private key content (paste full key including headers)
# Application version — used by the /api/v1/updates/check endpoint
app_version: str = "1.0.0-7"
settings = Settings()

View file

@ -82,6 +82,13 @@ CREATE INDEX IF NOT EXISTS idx_audit_events_job ON audit_events(burnin_job_id)
_MIGRATIONS = [
"ALTER TABLE drives ADD COLUMN notes TEXT",
"ALTER TABLE drives ADD COLUMN location TEXT",
# Stage 7: SSH command output + SMART attribute storage
"ALTER TABLE burnin_stages ADD COLUMN log_text TEXT",
"ALTER TABLE burnin_stages ADD COLUMN bad_blocks INTEGER DEFAULT 0",
"ALTER TABLE drives ADD COLUMN smart_attrs TEXT",
"ALTER TABLE smart_tests ADD COLUMN raw_output TEXT",
# Stage 8: track last reset time so dashboard burn-in col clears after reset
"ALTER TABLE drives ADD COLUMN last_reset_at TEXT",
]

View file

@ -23,8 +23,10 @@ async def notify_job_complete(
profile: str,
operator: str,
error_text: str | None,
bad_blocks: int = 0,
) -> None:
"""Fire all configured notifications for a completed burn-in job."""
from datetime import datetime, timezone
tasks = []
if settings.webhook_url:
@ -38,6 +40,8 @@ async def notify_job_complete(
"profile": profile,
"operator": operator,
"error_text": error_text,
"bad_blocks": bad_blocks,
"timestamp": datetime.now(timezone.utc).isoformat(),
}))
if settings.smtp_host:

View file

@ -20,13 +20,15 @@ from app.truenas import TrueNASClient
log = logging.getLogger(__name__)
# Shared state read by the /health endpoint
# Shared state read by the /health endpoint and dashboard template
_state: dict[str, Any] = {
"last_poll_at": None,
"last_error": None,
"healthy": False,
"drives_seen": 0,
"consecutive_failures": 0,
"system_temps": {}, # {"cpu_c": int|None, "pch_c": int|None}
"thermal_pressure": "ok", # "ok" | "warn" | "crit" — based on running burn-in drive temps
}
# SSE subscriber queues — notified after each successful poll
@ -208,6 +210,67 @@ async def _sync_history(
# Poll cycle
# ---------------------------------------------------------------------------
async def _poll_smart_via_ssh(db: aiosqlite.Connection, now: str) -> None:
"""
Poll progress for SMART tests started via SSH (truenas_job_id IS NULL).
Used on TrueNAS SCALE 25.10+ where the REST smart/test API no longer exists.
"""
from app import ssh_client
if not ssh_client.is_configured():
return
cur = await db.execute(
"""SELECT st.id, st.test_type, st.drive_id, d.devname, st.started_at
FROM smart_tests st
JOIN drives d ON d.id = st.drive_id
WHERE st.state = 'running' AND st.truenas_job_id IS NULL"""
)
rows = await cur.fetchall()
if not rows:
return
for row in rows:
test_id, ttype, drive_id, devname, started_at = row[0], row[1], row[2], row[3], row[4]
try:
progress = await ssh_client.poll_smart_progress(devname)
except Exception as exc:
log.warning("SSH SMART poll failed for %s: %s", devname, exc)
continue
state = progress["state"]
pct_remaining = progress.get("percent_remaining") # None = not yet in output
raw_output = progress.get("output", "")
if state == "running":
# pct_remaining=None means smartctl output doesn't have the % line yet
# (test just started) — keep percent at 0 rather than jumping to 100
if pct_remaining is None:
pct = 0
else:
pct = max(0, 100 - pct_remaining)
eta = _eta_from_progress(pct, started_at)
await db.execute(
"UPDATE smart_tests SET percent=?, eta_at=?, raw_output=? WHERE id=?",
(pct, eta, raw_output, test_id),
)
elif state == "passed":
await db.execute(
"UPDATE smart_tests SET state='passed', percent=100, finished_at=?, raw_output=? WHERE id=?",
(now, raw_output, test_id),
)
log.info("SSH SMART %s passed on %s", ttype, devname)
elif state == "failed":
await db.execute(
"UPDATE smart_tests SET state='failed', percent=0, finished_at=?, "
"error_text=?, raw_output=? WHERE id=?",
(now, f"SMART {ttype.upper()} test failed", raw_output, test_id),
)
log.warning("SSH SMART %s FAILED on %s", ttype, devname)
# state == "unknown" → keep polling, no update
await db.commit()
async def poll_cycle(client: TrueNASClient) -> int:
"""Run one full poll. Returns number of drives seen."""
now = _now()
@ -215,6 +278,20 @@ async def poll_cycle(client: TrueNASClient) -> int:
disks = await client.get_disks()
running_jobs = await client.get_smart_jobs(state="RUNNING")
# Fetch temperatures via SCALE-specific endpoint.
# CORE doesn't have this endpoint — silently skip on any error.
try:
temps = await client.get_disk_temperatures()
except Exception:
temps = {}
# Inject temperature into each disk dict (SCALE 25.10 has no temp in /disk)
for disk in disks:
devname = disk.get("devname", "")
t = temps.get(devname)
if t is not None:
disk["temperature"] = int(round(t))
# Index running jobs by (devname, test_type)
active: dict[tuple[str, str], dict] = {}
for job in running_jobs:
@ -243,6 +320,9 @@ async def poll_cycle(client: TrueNASClient) -> int:
await db.commit()
# SSH SMART polling — for tests started via smartctl (no TrueNAS REST job)
await _poll_smart_via_ssh(db, now)
return len(disks)
@ -263,6 +343,39 @@ async def run(client: TrueNASClient) -> None:
_state["drives_seen"] = count
_state["consecutive_failures"] = 0
log.debug("Poll OK", extra={"drives": count})
# System sensor temps via SSH (non-fatal)
from app import ssh_client as _ssh
if _ssh.is_configured():
try:
_state["system_temps"] = await _ssh.get_system_sensors()
except Exception:
pass
# Thermal pressure: max temp of drives currently under burn-in
try:
async with aiosqlite.connect(settings.db_path) as _tdb:
_tdb.row_factory = aiosqlite.Row
await _tdb.execute("PRAGMA journal_mode=WAL")
_cur = await _tdb.execute("""
SELECT MAX(d.temperature_c)
FROM drives d
JOIN burnin_jobs bj ON bj.drive_id = d.id
WHERE bj.state = 'running' AND d.temperature_c IS NOT NULL
""")
_row = await _cur.fetchone()
_max_t = _row[0] if _row and _row[0] is not None else None
if _max_t is None:
_state["thermal_pressure"] = "ok"
elif _max_t >= settings.temp_crit_c:
_state["thermal_pressure"] = "crit"
elif _max_t >= settings.temp_warn_c:
_state["thermal_pressure"] = "warn"
else:
_state["thermal_pressure"] = "ok"
except Exception:
_state["thermal_pressure"] = "ok"
_notify_subscribers()
# Check for stuck jobs every 5 cycles (~1 min at default 12s interval)

View file

@ -37,9 +37,10 @@ def _format_eta(seconds: int | None) -> str:
def _temp_class(celsius: int | None) -> str:
if celsius is None:
return ""
if celsius < 40:
from app.config import settings
if celsius < settings.temp_warn_c:
return "temp-cool"
if celsius < 50:
if celsius < settings.temp_crit_c:
return "temp-warm"
return "temp-hot"
@ -125,7 +126,7 @@ def _format_elapsed(iso: str | None) -> str:
return ""
# Register
# Register filters
templates.env.filters["format_bytes"] = _format_bytes
templates.env.filters["format_eta"] = _format_eta
templates.env.filters["temp_class"] = _temp_class
@ -134,3 +135,7 @@ templates.env.filters["format_dt_full"] = _format_dt_full
templates.env.filters["format_duration"] = _format_duration
templates.env.filters["format_elapsed"] = _format_elapsed
templates.env.globals["drive_status"] = _drive_status
from app.config import settings as _settings
templates.env.globals["app_version"] = _settings.app_version

View file

@ -5,7 +5,7 @@ import json
from datetime import datetime, timezone
import aiosqlite
from fastapi import APIRouter, Depends, HTTPException, Query, Request
from fastapi import APIRouter, Depends, HTTPException, Query, Request, WebSocket
from fastapi.responses import HTMLResponse, StreamingResponse
from sse_starlette.sse import EventSourceResponse
@ -118,11 +118,17 @@ _DRIVES_QUERY = """
async def _fetch_burnin_by_drive(db: aiosqlite.Connection) -> dict[int, dict]:
"""Return latest burn-in job (any state) keyed by drive_id."""
"""Return latest burn-in job (any state) keyed by drive_id.
Jobs created before the drive's last_reset_at are excluded so the
dashboard burn-in column clears after a reset while history is preserved.
"""
cur = await db.execute("""
SELECT bj.*
FROM burnin_jobs bj
JOIN drives d ON d.id = bj.drive_id
WHERE bj.id IN (SELECT MAX(id) FROM burnin_jobs GROUP BY drive_id)
AND (d.last_reset_at IS NULL OR bj.created_at > d.last_reset_at)
""")
rows = await cur.fetchall()
return {r["drive_id"]: dict(r) for r in rows}
@ -212,6 +218,18 @@ async def sse_drives(request: Request):
yield {"event": "drives-update", "data": html}
# Push system sensor state so JS can update temp chips live
ps = poller.get_state()
yield {
"event": "system-sensors",
"data": json.dumps({
"system_temps": ps.get("system_temps", {}),
"thermal_pressure": ps.get("thermal_pressure", "ok"),
"temp_warn_c": settings.temp_warn_c,
"temp_crit_c": settings.temp_crit_c,
}),
}
# Push browser notification event if this was a job completion
if alert:
yield {"event": "job-alert", "data": json.dumps(alert)}
@ -249,6 +267,87 @@ async def list_drives(db: aiosqlite.Connection = Depends(get_db)):
return [_row_to_drive(r) for r in rows]
@router.get("/api/v1/drives/{drive_id}/drawer")
async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
"""Data for the log drawer — latest burn-in job + stages, SMART tests, audit events."""
cur = await db.execute(_DRIVES_QUERY.format(where="WHERE d.id = ?"), (drive_id,))
row = await cur.fetchone()
if not row:
raise HTTPException(status_code=404, detail="Drive not found")
drive = _row_to_drive(row)
# Latest burn-in job + its stages (include log_text and bad_blocks)
cur = await db.execute(
"SELECT * FROM burnin_jobs WHERE drive_id=? ORDER BY id DESC LIMIT 1",
(drive_id,),
)
job_row = await cur.fetchone()
burnin = None
if job_row:
job = dict(job_row)
cur = await db.execute(
"SELECT id, stage_name, state, percent, started_at, finished_at, "
"duration_seconds, error_text, log_text, bad_blocks "
"FROM burnin_stages WHERE burnin_job_id=? ORDER BY id",
(job_row["id"],),
)
job["stages"] = [dict(r) for r in await cur.fetchall()]
burnin = job
# SMART raw output from smart_tests table
cur = await db.execute(
"SELECT test_type, state, percent, started_at, finished_at, error_text, raw_output "
"FROM smart_tests WHERE drive_id=?",
(drive_id,),
)
smart_rows = {r["test_type"]: dict(r) for r in await cur.fetchall()}
# Cached SMART attributes (JSON blob on drives table)
import json as _json
smart_attrs = None
cur = await db.execute("SELECT smart_attrs FROM drives WHERE id=?", (drive_id,))
attrs_row = await cur.fetchone()
if attrs_row and attrs_row["smart_attrs"]:
try:
smart_attrs = _json.loads(attrs_row["smart_attrs"])
except Exception:
pass
# Last 50 audit events for this drive (newest first)
cur = await db.execute("""
SELECT id, event_type, operator, message, created_at
FROM audit_events
WHERE drive_id = ?
ORDER BY id DESC
LIMIT 50
""", (drive_id,))
events = [dict(r) for r in await cur.fetchall()]
def _smart_card(test_type: str) -> dict:
smart_obj = drive.smart_short if test_type == "short" else drive.smart_long
base = smart_obj.model_dump() if smart_obj else {}
row = smart_rows.get(test_type, {})
base["raw_output"] = row.get("raw_output")
return base
return {
"drive": {
"id": drive.id,
"devname": drive.devname,
"serial": drive.serial,
"model": drive.model,
"size_bytes": drive.size_bytes,
},
"burnin": burnin,
"smart": {
"short": _smart_card("short"),
"long": _smart_card("long"),
"attrs": smart_attrs,
},
"events": events,
}
@router.get("/api/v1/drives/{drive_id}", response_model=DriveResponse)
async def get_drive(drive_id: int, db: aiosqlite.Connection = Depends(get_db)):
cur = await db.execute(
@ -266,9 +365,13 @@ async def smart_start(
body: dict,
db: aiosqlite.Connection = Depends(get_db),
):
"""Start a standalone SHORT or LONG SMART test on a single drive."""
from app.truenas import TrueNASClient
from app import burnin as _burnin
"""Start a standalone SHORT or LONG SMART test on a single drive.
Uses SSH (smartctl) when configured required for TrueNAS SCALE 25.10+
where the REST smart/test endpoint no longer exists.
Falls back to TrueNAS REST API for older versions.
"""
from app import burnin as _burnin, ssh_client
test_type = (body.get("type") or "").upper()
if test_type not in ("SHORT", "LONG"):
@ -280,16 +383,41 @@ async def smart_start(
raise HTTPException(status_code=404, detail="Drive not found")
devname = row[0]
# Use the shared TrueNAS client held by the burnin module
now = datetime.now(timezone.utc).isoformat()
ttype_lower = test_type.lower()
if ssh_client.is_configured():
# SSH path — works on TrueNAS SCALE 25.10+ and CORE
try:
output = await ssh_client.start_smart_test(devname, test_type)
except Exception as exc:
raise HTTPException(status_code=502, detail=f"SSH error: {exc}")
# Mark as running in DB (truenas_job_id=NULL signals SSH-managed test)
# Store smartctl start output as proof the test was initiated
await db.execute(
"""INSERT INTO smart_tests (drive_id, test_type, state, percent, started_at, raw_output)
VALUES (?,?,?,?,?,?)
ON CONFLICT(drive_id, test_type) DO UPDATE SET
state='running', percent=0, truenas_job_id=NULL,
started_at=excluded.started_at, finished_at=NULL, error_text=NULL,
raw_output=excluded.raw_output""",
(drive_id, ttype_lower, "running", 0, now, output),
)
await db.commit()
from app import poller as _poller
_poller._notify_subscribers()
return {"devname": devname, "type": test_type, "message": output[:200]}
else:
# REST path — older TrueNAS CORE / SCALE versions
client = _burnin._client
if client is None:
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
try:
tn_job_id = await client.start_smart_test([devname], test_type)
except Exception as exc:
raise HTTPException(status_code=502, detail=f"TrueNAS error: {exc}")
return {"job_id": tn_job_id, "devname": devname, "type": test_type}
@ -316,7 +444,16 @@ async def smart_cancel(
if client is None:
raise HTTPException(status_code=503, detail="TrueNAS client not ready")
# Find the running TrueNAS job for this drive/test-type
from app import ssh_client
if ssh_client.is_configured():
# SSH path — abort via smartctl -X
try:
await ssh_client.abort_smart_test(devname)
except Exception as exc:
raise HTTPException(status_code=502, detail=f"SSH abort error: {exc}")
else:
# REST path — find TrueNAS job and abort it
try:
jobs = await client.get_smart_jobs()
tn_job_id = None
@ -620,6 +757,57 @@ async def update_drive(
return {"updated": True}
@router.post("/api/v1/drives/{drive_id}/reset")
async def reset_drive(
drive_id: int,
body: dict,
db: aiosqlite.Connection = Depends(get_db),
):
"""
Clear SMART test results for a drive so it shows as fresh.
Only allowed when no burn-in job is active (queued or running).
Preserves all job history just resets the display state.
"""
cur = await db.execute("SELECT id FROM drives WHERE id=?", (drive_id,))
if not await cur.fetchone():
raise HTTPException(status_code=404, detail="Drive not found")
# Reject if any active burn-in
cur = await db.execute(
"SELECT COUNT(*) FROM burnin_jobs WHERE drive_id=? AND state IN ('queued','running')",
(drive_id,),
)
if (await cur.fetchone())[0] > 0:
raise HTTPException(status_code=409, detail="Cannot reset while a burn-in is active")
operator = body.get("operator", "operator")
# Reset SMART test state to idle
await db.execute(
"""UPDATE smart_tests SET state='idle', percent=0, started_at=NULL,
eta_at=NULL, finished_at=NULL, error_text=NULL, raw_output=NULL
WHERE drive_id=?""",
(drive_id,),
)
# Clear SMART attrs cache + stamp reset time (hides prior burn-in from dashboard)
now = datetime.now(timezone.utc).isoformat()
await db.execute(
"UPDATE drives SET smart_attrs=NULL, last_reset_at=? WHERE id=?",
(now, drive_id),
)
# Audit event
await db.execute(
"""INSERT INTO audit_events (event_type, drive_id, operator, message)
VALUES (?,?,?,?)""",
("drive_reset", drive_id, operator, "Drive reset — SMART state cleared"),
)
await db.commit()
poller._notify_subscribers()
return {"reset": True}
# ---------------------------------------------------------------------------
# Audit log page
# ---------------------------------------------------------------------------
@ -714,6 +902,36 @@ async def stats_page(
""")
by_day = [dict(r) for r in await cur.fetchall()]
# Average test duration by drive size (rounded to nearest TB)
cur = await db.execute("""
SELECT
CAST(ROUND(CAST(d.size_bytes AS REAL) / 1e12) AS INTEGER) AS size_tb,
COUNT(*) AS total,
ROUND(AVG(
(julianday(bj.finished_at) - julianday(bj.started_at)) * 86400 / 3600.0
), 1) AS avg_hours
FROM burnin_jobs bj
JOIN drives d ON d.id = bj.drive_id
WHERE bj.state IN ('passed', 'failed')
AND bj.started_at IS NOT NULL
AND bj.finished_at IS NOT NULL
GROUP BY size_tb
ORDER BY size_tb
""")
by_size = [dict(r) for r in await cur.fetchall()]
# Failure breakdown by stage (which stage caused the failure)
cur = await db.execute("""
SELECT
COALESCE(bj.stage_name, 'unknown') AS failed_stage,
COUNT(*) AS count
FROM burnin_jobs bj
WHERE bj.state = 'failed'
GROUP BY failed_stage
ORDER BY count DESC
""")
by_failure_stage = [dict(r) for r in await cur.fetchall()]
# Drives tracked
cur = await db.execute("SELECT COUNT(*) FROM drives")
drives_total = (await cur.fetchone())[0]
@ -724,6 +942,8 @@ async def stats_page(
"overall": overall,
"by_model": by_model,
"by_day": by_day,
"by_size": by_size,
"by_failure_stage": by_failure_stage,
"drives_total": drives_total,
"poller": ps,
**_stale_context(ps),
@ -739,18 +959,9 @@ async def settings_page(
request: Request,
db: aiosqlite.Connection = Depends(get_db),
):
# Read-only display values (require container restart to change)
readonly = {
"truenas_base_url": settings.truenas_base_url,
"truenas_verify_tls": settings.truenas_verify_tls,
"poll_interval_seconds": settings.poll_interval_seconds,
"stale_threshold_seconds": settings.stale_threshold_seconds,
"allowed_ips": settings.allowed_ips or "(allow all)",
"log_level": settings.log_level,
}
# Editable values — real values for form fields (password excluded)
# Editable values — real values for form fields (secrets excluded)
editable = {
# SMTP
"smtp_host": settings.smtp_host,
"smtp_port": settings.smtp_port,
"smtp_ssl_mode": settings.smtp_ssl_mode or "starttls",
@ -762,17 +973,37 @@ async def settings_page(
"smtp_daily_report_enabled": settings.smtp_daily_report_enabled,
"smtp_alert_on_fail": settings.smtp_alert_on_fail,
"smtp_alert_on_pass": settings.smtp_alert_on_pass,
# Webhook
"webhook_url": settings.webhook_url,
# Burn-in behaviour
"stuck_job_hours": settings.stuck_job_hours,
"max_parallel_burnins": settings.max_parallel_burnins,
"temp_warn_c": settings.temp_warn_c,
"temp_crit_c": settings.temp_crit_c,
"bad_block_threshold": settings.bad_block_threshold,
# SSH credentials (take effect immediately — each SSH call reads live settings)
"ssh_host": settings.ssh_host,
"ssh_port": settings.ssh_port,
"ssh_user": settings.ssh_user,
# Note: ssh_password and ssh_key intentionally omitted from display (sensitive)
# System settings (restart required to fully apply)
"truenas_base_url": settings.truenas_base_url,
"truenas_verify_tls": settings.truenas_verify_tls,
"poll_interval_seconds": settings.poll_interval_seconds,
"stale_threshold_seconds": settings.stale_threshold_seconds,
"allowed_ips": settings.allowed_ips,
"log_level": settings.log_level,
# Note: truenas_api_key intentionally omitted from display (sensitive)
}
from app import ssh_client as _ssh
ps = poller.get_state()
return templates.TemplateResponse("settings.html", {
"request": request,
"readonly": readonly,
"editable": editable,
"smtp_enabled": bool(settings.smtp_host),
"ssh_configured": _ssh.is_configured(),
"app_version": settings.app_version,
"poller": ps,
**_stale_context(ps),
})
@ -780,10 +1011,11 @@ async def settings_page(
@router.post("/api/v1/settings")
async def save_settings(body: dict):
"""Save editable runtime settings. Password is only updated if non-empty."""
# Don't overwrite password if client sent empty string
if "smtp_password" in body and body["smtp_password"] == "":
del body["smtp_password"]
"""Save editable runtime settings. Secrets are only updated if non-empty."""
# Don't overwrite secrets if client sent empty string
for secret_field in ("smtp_password", "truenas_api_key", "ssh_password", "ssh_key"):
if secret_field in body and body[secret_field] == "":
del body[secret_field]
try:
saved = settings_store.save(body)
@ -802,6 +1034,55 @@ async def test_smtp():
return {"ok": True}
@router.post("/api/v1/settings/test-ssh")
async def test_ssh():
"""Test the current SSH configuration."""
from app import ssh_client
result = await ssh_client.test_connection()
if not result["ok"]:
raise HTTPException(status_code=502, detail=result.get("error", "Connection failed"))
return {"ok": True}
@router.websocket("/ws/terminal")
async def terminal_ws(websocket: WebSocket):
"""WebSocket endpoint bridging the browser xterm.js terminal to an SSH PTY."""
from app import terminal as _term
await _term.handle(websocket)
@router.get("/api/v1/updates/check")
async def check_updates():
"""Check for a newer release on Forgejo."""
import httpx
current = settings.app_version
try:
async with httpx.AsyncClient(timeout=8.0) as client:
r = await client.get(
"https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest",
headers={"Accept": "application/json"},
)
if r.status_code == 200:
data = r.json()
latest = data.get("tag_name", "").lstrip("v")
up_to_date = not latest or latest == current
return {
"current": current,
"latest": latest or None,
"update_available": not up_to_date,
"message": None,
}
elif r.status_code == 404:
return {"current": current, "latest": None, "update_available": False,
"message": "No releases published yet"}
else:
return {"current": current, "latest": None, "update_available": False,
"message": f"Forgejo API returned {r.status_code}"}
except Exception as exc:
return {"current": current, "latest": None, "update_available": False,
"message": f"Could not reach update server: {exc}"}
# ---------------------------------------------------------------------------
# Print view (must be BEFORE /{job_id} int route)
# ---------------------------------------------------------------------------

View file

@ -4,8 +4,8 @@ Runtime settings store — persists editable settings to /data/settings_override
Changes take effect immediately (in-memory setattr on the global Settings object)
and survive restarts (JSON file is loaded in main.py lifespan).
Settings that require a container restart (TrueNAS URL, poll interval, allowed IPs, etc.)
are NOT included here and are display-only on the settings page.
System settings (TrueNAS URL, poll interval, etc.) are saved to JSON but require
a container restart to fully take effect (clients/middleware are initialized at boot).
"""
import json
@ -18,6 +18,7 @@ log = logging.getLogger(__name__)
# Field name → coerce function. Only fields listed here are accepted by save().
_EDITABLE: dict[str, type] = {
# Email / SMTP
"smtp_host": str,
"smtp_ssl_mode": str,
"smtp_timeout": int,
@ -29,12 +30,32 @@ _EDITABLE: dict[str, type] = {
"smtp_report_hour": int,
"smtp_alert_on_fail": bool,
"smtp_alert_on_pass": bool,
# Webhook
"webhook_url": str,
# Burn-in behaviour
"stuck_job_hours": int,
"max_parallel_burnins": int,
"temp_warn_c": int,
"temp_crit_c": int,
"bad_block_threshold": int,
# SSH credentials — take effect immediately (each connection reads live settings)
"ssh_host": str,
"ssh_port": int,
"ssh_user": str,
"ssh_password": str,
"ssh_key": str,
# System settings — saved to JSON; require container restart to fully apply
"truenas_base_url": str,
"truenas_api_key": str,
"truenas_verify_tls": bool,
"poll_interval_seconds": int,
"stale_threshold_seconds": int,
"allowed_ips": str,
"log_level": str,
}
_VALID_SSL_MODES = {"starttls", "ssl", "plain"}
_VALID_LOG_LEVELS = {"DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"}
def _overrides_path() -> Path:
@ -63,6 +84,21 @@ def _apply(data: dict) -> None:
if key == "smtp_report_hour" and not (0 <= int(val) <= 23):
log.warning("settings_store: smtp_report_hour out of range — ignoring")
continue
if key == "log_level" and val not in _VALID_LOG_LEVELS:
log.warning("settings_store: invalid log_level %r — ignoring", val)
continue
if key in ("poll_interval_seconds", "stale_threshold_seconds") and int(val) < 1:
log.warning("settings_store: %s must be >= 1 — ignoring", key)
continue
if key in ("temp_warn_c", "temp_crit_c") and not (20 <= int(val) <= 80):
log.warning("settings_store: %s out of range (2080) — ignoring", key)
continue
if key == "bad_block_threshold" and int(val) < 0:
log.warning("settings_store: bad_block_threshold must be >= 0 — ignoring")
continue
if key == "ssh_port" and not (1 <= int(val) <= 65535):
log.warning("settings_store: ssh_port out of range — ignoring")
continue
setattr(settings, key, val)
except (ValueError, TypeError) as exc:
log.warning("settings_store: invalid value for %s: %s", key, exc)

View file

@ -0,0 +1,386 @@
"""
SSH client for direct TrueNAS command execution (Stage 7).
When ssh_host is configured, burn-in stages use SSH to run smartctl and
badblocks directly on the TrueNAS host instead of going through the REST API.
Falls back to REST API / simulation when SSH is not configured (dev/mock mode).
TrueNAS CORE (FreeBSD) device paths: /dev/ada0, /dev/da0, etc.
TrueNAS SCALE (Linux) device paths: /dev/sda, /dev/sdb, etc.
The devname from the TrueNAS API is used as-is in /dev/{devname}.
"""
import asyncio
import logging
import re
from typing import Callable
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Monitored SMART attributes
# True → any non-zero raw value is a hard failure (drive rejected)
# False → non-zero is a warning (flagged but test continues)
# ---------------------------------------------------------------------------
SMART_ATTRS: dict[int, tuple[str, bool]] = {
5: ("Reallocated_Sector_Ct", True), # reallocation = FAIL
10: ("Spin_Retry_Count", False), # mechanical stress = WARN
188: ("Command_Timeout", False), # drive not responding = WARN
197: ("Current_Pending_Sector", True), # pending reallocation = FAIL
198: ("Offline_Uncorrectable", True), # unrecoverable read error = FAIL
199: ("UDMA_CRC_Error_Count", False), # cable/controller issue = WARN
}
# ---------------------------------------------------------------------------
# Configuration check
# ---------------------------------------------------------------------------
def is_configured() -> bool:
"""Returns True when SSH host + at least one auth method is available."""
import os
from app.config import settings
if not settings.ssh_host:
return False
has_creds = bool(
settings.ssh_key
or settings.ssh_password
or os.path.exists(os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH))
)
return has_creds
# ---------------------------------------------------------------------------
# Low-level connection
# ---------------------------------------------------------------------------
_MOUNTED_KEY_PATH = "/run/secrets/ssh_key"
async def _connect():
"""Open a single-use SSH connection. Caller must use `async with`."""
import asyncssh
from app.config import settings
kwargs: dict = {
"host": settings.ssh_host,
"port": settings.ssh_port,
"username": settings.ssh_user,
"known_hosts": None, # trust all hosts (same spirit as TRUENAS_VERIFY_TLS=false)
}
if settings.ssh_key:
# Key material provided via env var (base case)
kwargs["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)]
elif settings.ssh_password:
kwargs["password"] = settings.ssh_password
else:
# Fall back to mounted key file (preferred for production — no key in env vars)
import os
key_path = os.environ.get("SSH_KEY_FILE", _MOUNTED_KEY_PATH)
if os.path.exists(key_path):
kwargs["client_keys"] = [key_path]
# If nothing is configured, asyncssh will attempt agent/default key lookup
return asyncssh.connect(**kwargs)
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
async def test_connection() -> dict:
"""Test SSH connectivity. Returns {"ok": True} or {"ok": False, "error": str}."""
if not is_configured():
return {"ok": False, "error": "SSH not configured (ssh_host is empty)"}
try:
async with await _connect() as conn:
result = await conn.run("echo ok", check=False)
if "ok" in result.stdout:
return {"ok": True}
return {"ok": False, "error": result.stderr.strip() or "unexpected output"}
except Exception as exc:
return {"ok": False, "error": str(exc)}
async def get_smart_attributes(devname: str) -> dict:
"""
Run `smartctl -a /dev/{devname}` and parse the output.
Returns:
health: str "PASSED" | "FAILED" | "UNKNOWN"
raw_output: str full smartctl output
attributes: dict[int, {"name": str, "raw": int}]
warnings: list[str] attribute names with non-zero raw (non-critical)
failures: list[str] attribute names with non-zero raw (critical)
"""
cmd = f"smartctl -a /dev/{devname}"
try:
async with await _connect() as conn:
result = await conn.run(cmd, check=False)
output = result.stdout + result.stderr
return _parse_smartctl(output)
except Exception as exc:
return {
"health": "UNKNOWN",
"raw_output": str(exc),
"attributes": {},
"warnings": [],
"failures": [f"SSH error: {exc}"],
}
async def start_smart_test(devname: str, test_type: str) -> str:
"""
Run `smartctl -t short|long /dev/{devname}`.
Returns raw output. Raises RuntimeError on unrecoverable failure.
test_type: "SHORT" or "LONG"
"""
arg = "short" if test_type.upper() == "SHORT" else "long"
cmd = f"smartctl -t {arg} /dev/{devname}"
async with await _connect() as conn:
result = await conn.run(cmd, check=False)
output = result.stdout + result.stderr
# smartctl exits 0 or 4 when the test is successfully started on most drives
started = ("Testing has begun" in output or
"test has begun" in output.lower() or
result.returncode in (0, 4))
if not started:
raise RuntimeError(f"smartctl returned exit {result.returncode}: {output[:400]}")
return output
async def poll_smart_progress(devname: str) -> dict:
"""
Run `smartctl -a /dev/{devname}` and extract self-test status.
Returns:
state: "running" | "passed" | "failed" | "unknown"
percent_remaining: int (0 = complete when state != "running")
output: str
"""
cmd = f"smartctl -a /dev/{devname}"
async with await _connect() as conn:
result = await conn.run(cmd, check=False)
output = result.stdout + result.stderr
return _parse_smart_progress(output)
async def abort_smart_test(devname: str) -> None:
"""Send `smartctl -X /dev/{devname}` to abort an in-progress test."""
cmd = f"smartctl -X /dev/{devname}"
async with await _connect() as conn:
await conn.run(cmd, check=False)
async def run_badblocks(
devname: str,
on_progress: Callable[[int, int, str], None],
cancelled_fn: Callable[[], bool] | None = None,
) -> dict:
"""
Run `badblocks -wsv -b 4096 -p 1 /dev/{devname}` and stream output.
on_progress(percent, bad_blocks, line) is called for each line of output.
cancelled_fn() is polled to support mid-test cancellation.
Returns: {"bad_blocks": int, "output": str, "aborted": bool}
"""
from app.config import settings
cmd = f"badblocks -wsv -b 4096 -p 1 /dev/{devname}"
lines: list[str] = []
bad_blocks = 0
aborted = False
last_pct = 0
try:
async with await _connect() as conn:
async with conn.create_process(cmd) as proc:
# badblocks writes progress to stderr, bad block numbers to stdout
async def _read_stream(stream, is_stderr: bool):
nonlocal bad_blocks, last_pct, aborted
async for raw_line in stream:
line = raw_line if isinstance(raw_line, str) else raw_line.decode("utf-8", errors="replace")
lines.append(line)
if is_stderr:
m = re.search(r"([\d.]+)%\s+done", line)
if m:
last_pct = min(99, int(float(m.group(1))))
else:
# Each non-empty stdout line during badblocks is a bad block number
stripped = line.strip()
if stripped and stripped.isdigit():
bad_blocks += 1
on_progress(last_pct, bad_blocks, line)
# Abort if threshold exceeded
if bad_blocks > settings.bad_block_threshold:
aborted = True
proc.kill()
lines.append(
f"\n[ABORTED] Bad block count ({bad_blocks}) exceeded "
f"threshold ({settings.bad_block_threshold})\n"
)
return
# Abort on cancellation
if cancelled_fn and cancelled_fn():
aborted = True
proc.kill()
return
stdout_task = asyncio.create_task(_read_stream(proc.stdout, False))
stderr_task = asyncio.create_task(_read_stream(proc.stderr, True))
await asyncio.gather(stdout_task, stderr_task, return_exceptions=True)
await proc.wait()
except Exception as exc:
lines.append(f"\n[SSH error] {exc}\n")
if not aborted:
last_pct = 100
return {
"bad_blocks": bad_blocks,
"output": "".join(lines),
"aborted": aborted,
}
async def get_system_sensors() -> dict:
"""
Run `sensors -j` on TrueNAS and extract system-level temperatures.
Returns {"cpu_c": int|None, "pch_c": int|None}.
cpu_c = CPU package temp (coretemp chip)
pch_c = PCH/chipset temp (pch_* chip) proxy for storage I/O lane thermals
Falls back gracefully if SSH is not configured or lm-sensors is unavailable.
"""
if not is_configured():
return {}
try:
async with await _connect() as conn:
result = await conn.run("sensors -j 2>/dev/null", check=False)
output = result.stdout.strip()
if not output:
return {}
return _parse_sensors_json(output)
except Exception as exc:
log.debug("get_system_sensors failed: %s", exc)
return {}
def _parse_sensors_json(output: str) -> dict:
import json as _json
try:
data = _json.loads(output)
except Exception:
return {}
cpu_c: int | None = None
pch_c: int | None = None
for chip_name, chip_data in data.items():
if not isinstance(chip_data, dict):
continue
# CPU package temp — coretemp chip, "Package id N" sensor
if chip_name.startswith("coretemp") and cpu_c is None:
for sensor_name, sensor_vals in chip_data.items():
if not isinstance(sensor_vals, dict):
continue
if "package" in sensor_name.lower():
for k, v in sensor_vals.items():
if k.endswith("_input") and isinstance(v, (int, float)):
cpu_c = int(round(v))
break
if cpu_c is not None:
break
# PCH / chipset temp — manages PCIe lanes including HBA / storage I/O
elif chip_name.startswith("pch_") and pch_c is None:
for sensor_name, sensor_vals in chip_data.items():
if not isinstance(sensor_vals, dict):
continue
for k, v in sensor_vals.items():
if k.endswith("_input") and isinstance(v, (int, float)):
pch_c = int(round(v))
break
if pch_c is not None:
break
return {"cpu_c": cpu_c, "pch_c": pch_c}
# ---------------------------------------------------------------------------
# Parsers
# ---------------------------------------------------------------------------
def _parse_smartctl(output: str) -> dict:
health = "UNKNOWN"
attributes: dict[int, dict] = {}
warnings: list[str] = []
failures: list[str] = []
m = re.search(r"self-assessment test result:\s+(\w+)", output, re.IGNORECASE)
if m:
health = m.group(1).upper()
# Attribute table: ID# NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
for line in output.splitlines():
am = re.match(
r"\s*(\d+)\s+(\S+)\s+\S+\s+\d+\s+\d+\s+\d+\s+\S+\s+\S+\s+\S+\s+(\d+)",
line,
)
if not am:
continue
attr_id = int(am.group(1))
attr_name = am.group(2)
raw_val = int(am.group(3))
attributes[attr_id] = {"name": attr_name, "raw": raw_val}
if attr_id in SMART_ATTRS:
_, is_critical = SMART_ATTRS[attr_id]
if raw_val > 0:
msg = f"{attr_name} = {raw_val}"
if is_critical:
failures.append(msg)
else:
warnings.append(msg)
return {
"health": health,
"raw_output": output,
"attributes": attributes,
"warnings": warnings,
"failures": failures,
}
def _parse_smart_progress(output: str) -> dict:
state = "unknown"
percent_remaining = None # None = "in progress but no % line parsed yet"
lower = output.lower()
if "self-test routine in progress" in lower or "self-test routine in progress" in output:
state = "running"
m = re.search(r"(\d+)%\s+of\s+test\s+remaining", output, re.IGNORECASE)
if m:
percent_remaining = int(m.group(1))
elif "completed without error" in lower:
state = "passed"
elif (
"completed: read failure" in lower
or "completed: write failure" in lower
or "aborted by host" in lower
or ("completed" in lower and "failure" in lower)
):
state = "failed"
elif "in progress" in lower:
state = "running"
return {
"state": state,
"percent_remaining": percent_remaining,
"output": output,
}

View file

@ -755,6 +755,11 @@ tr:hover td {
flex-direction: column;
gap: 8px;
pointer-events: none;
transition: bottom 0.25s ease;
}
body.drawer-open #toast-container {
bottom: calc(45vh + 16px);
}
.toast {
@ -1071,6 +1076,56 @@ a.stat-card:hover {
.stat-passed .stat-value { color: var(--green); }
.stat-idle .stat-value { color: var(--text-muted); }
/* Vertical separator between drive-count cards and sensor chips */
.stats-bar-sep {
width: 1px;
height: 36px;
background: var(--border);
align-self: center;
flex-shrink: 0;
}
/* Compact sensor chip — CPU / PCH / Thermal */
.stat-sensor {
background: var(--bg-card);
border: 1px solid var(--border);
border-radius: 8px;
padding: 6px 12px;
text-align: center;
min-width: 52px;
display: flex;
flex-direction: column;
gap: 2px;
}
.stat-sensor-val {
font-size: 16px;
font-weight: 700;
font-variant-numeric: tabular-nums;
line-height: 1.1;
}
.stat-sensor-label {
font-size: 9px;
text-transform: uppercase;
letter-spacing: 0.08em;
color: var(--text-muted);
line-height: 1.2;
}
/* Thermal pressure states */
.stat-sensor-thermal-warn {
border-color: var(--yellow-bd);
background: var(--yellow-bg);
}
.stat-sensor-thermal-warn .stat-sensor-val { color: var(--yellow); }
.stat-sensor-thermal-crit {
border-color: var(--red-bd);
background: var(--red-bg);
}
.stat-sensor-thermal-crit .stat-sensor-val { color: var(--red); }
/* -----------------------------------------------------------------------
Batch action bar (inside filter-bar)
----------------------------------------------------------------------- */
@ -1937,3 +1992,508 @@ a.header-brand:hover .header-title {
outline: 2px solid var(--blue);
outline-offset: 2px;
}
/* -----------------------------------------------------------------------
Log Drawer
----------------------------------------------------------------------- */
.log-drawer {
position: fixed;
bottom: 0;
left: 0;
right: 0;
height: 45vh;
min-height: 260px;
background: var(--bg-card);
border-top: 2px solid var(--border);
z-index: 150;
display: flex;
flex-direction: column;
box-shadow: 0 -6px 32px rgba(0,0,0,0.5);
animation: drawer-slide-up 0.18s ease;
}
.log-drawer[hidden] { display: none; }
@keyframes drawer-slide-up {
from { transform: translateY(100%); opacity: 0; }
to { transform: translateY(0); opacity: 1; }
}
/* Shrink table when drawer is open */
body.drawer-open .table-wrap {
max-height: calc(100vh - 205px - 45vh);
}
/* Drawer header */
.drawer-header {
display: flex;
align-items: center;
gap: 14px;
padding: 7px 16px;
border-bottom: 1px solid var(--border);
flex-shrink: 0;
background: var(--bg);
}
.drawer-drive-info {
display: flex;
flex-direction: column;
gap: 1px;
min-width: 80px;
}
.drawer-devname {
font-size: 13px;
font-weight: 600;
color: var(--text-strong);
font-family: "SF Mono", "Cascadia Code", monospace;
}
.drawer-drive-meta {
font-size: 11px;
color: var(--text-muted);
white-space: nowrap;
overflow: hidden;
text-overflow: ellipsis;
max-width: 240px;
}
/* Tabs */
.drawer-tabs {
display: flex;
gap: 2px;
}
.drawer-tab {
background: none;
border: 1px solid transparent;
border-radius: 5px;
color: var(--text-muted);
cursor: pointer;
font-size: 12px;
font-family: inherit;
font-weight: 500;
padding: 4px 12px;
transition: color 0.12s, background 0.12s;
}
.drawer-tab:hover {
color: var(--text);
background: var(--bg-card);
}
.drawer-tab.active {
color: var(--text-strong);
background: var(--bg-card);
border-color: var(--border);
}
/* Controls */
.drawer-controls {
display: flex;
align-items: center;
gap: 12px;
margin-left: auto;
flex-shrink: 0;
}
.autoscroll-label {
display: flex;
align-items: center;
gap: 5px;
font-size: 11px;
color: var(--text-muted);
cursor: pointer;
user-select: none;
}
.autoscroll-label input { accent-color: var(--blue); cursor: pointer; }
.drawer-close {
background: none;
border: 1px solid var(--border);
border-radius: 4px;
color: var(--text-muted);
cursor: pointer;
font-size: 12px;
width: 24px;
height: 24px;
display: flex;
align-items: center;
justify-content: center;
padding: 0;
transition: color 0.12s, border-color 0.12s;
}
.drawer-close:hover { color: var(--text); border-color: var(--text-muted); }
/* Body + panels */
.drawer-body {
flex: 1;
overflow: hidden;
position: relative;
}
.drawer-panel {
display: none;
height: 100%;
overflow-y: auto;
padding: 12px 16px 20px;
}
.drawer-panel.active { display: block; }
.drawer-loading,
.drawer-empty {
color: var(--text-muted);
font-size: 13px;
padding: 28px 0;
text-align: center;
}
/* Clickable rows */
#drives-tbody tr[id^="drive-"] { cursor: pointer; }
/* Active row highlight */
tr.drawer-row-active {
background: rgba(88, 166, 255, 0.07) !important;
outline: 1px solid var(--blue-bd);
outline-offset: -1px;
}
/* ---- Burn-In tab ---- */
.drawer-job-header {
display: flex;
align-items: center;
gap: 10px;
margin-bottom: 12px;
}
.drawer-job-meta {
font-size: 12px;
color: var(--text-muted);
}
.drawer-stages {
display: flex;
flex-direction: column;
gap: 6px;
}
.drawer-stage {
border: 1px solid var(--border);
border-radius: 6px;
overflow: hidden;
}
.stage-row-header {
display: flex;
align-items: center;
gap: 8px;
padding: 8px 12px;
font-size: 13px;
}
.stage-running .stage-row-header { background: var(--blue-bg); }
.stage-passed .stage-row-header { background: var(--green-bg); }
.stage-failed .stage-row-header { background: var(--red-bg); }
.stage-icon {
font-size: 12px;
width: 16px;
text-align: center;
flex-shrink: 0;
}
.stage-running .stage-icon { color: var(--blue); }
.stage-passed .stage-icon { color: var(--green); }
.stage-failed .stage-icon { color: var(--red); }
.stage-cancelled .stage-icon,
.stage-pending .stage-icon { color: var(--gray); }
.stage-name-label {
font-size: 13px;
font-weight: 500;
color: var(--text);
flex: 1;
}
.stage-pct {
font-size: 12px;
color: var(--blue);
font-weight: 600;
font-variant-numeric: tabular-nums;
}
.stage-duration {
font-size: 11px;
color: var(--text-muted);
font-variant-numeric: tabular-nums;
}
.stage-cursor {
color: var(--blue);
font-size: 14px;
animation: blink 1s step-end infinite;
}
@keyframes blink {
0%, 100% { opacity: 1; }
50% { opacity: 0; }
}
.stage-error-line {
padding: 7px 12px;
font-size: 12px;
color: var(--red);
font-family: "SF Mono", "Cascadia Code", monospace;
background: var(--red-bg);
border-top: 1px solid var(--red-bd);
white-space: pre-wrap;
word-break: break-word;
}
/* ---- SMART tab ---- */
.drawer-smart-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 12px;
}
.smart-card {
background: var(--bg);
border: 1px solid var(--border);
border-radius: 8px;
padding: 12px 14px;
display: flex;
flex-direction: column;
gap: 8px;
}
.smart-card-label {
font-size: 11px;
font-weight: 600;
text-transform: uppercase;
letter-spacing: 0.06em;
color: var(--text-muted);
}
.smart-progress {
display: flex;
align-items: center;
gap: 8px;
}
.smart-progress .progress-bar { flex: 1; }
.smart-detail {
font-size: 12px;
color: var(--text-muted);
}
/* ---- Events tab ---- */
.drawer-events {
display: flex;
flex-direction: column;
}
.drawer-event {
display: flex;
align-items: baseline;
gap: 10px;
padding: 7px 0;
border-bottom: 1px solid var(--border);
font-size: 12px;
}
.drawer-event:last-child { border-bottom: none; }
.event-time {
color: var(--text-muted);
font-size: 11px;
white-space: nowrap;
flex-shrink: 0;
font-variant-numeric: tabular-nums;
}
.event-type {
color: var(--blue);
font-weight: 500;
white-space: nowrap;
flex-shrink: 0;
}
.event-message {
color: var(--text);
flex: 1;
}
.event-operator {
color: var(--text-muted);
font-size: 11px;
white-space: nowrap;
flex-shrink: 0;
}
.drawer-event.event-error .event-type { color: var(--red); }
.drawer-event.event-error .event-message { color: var(--red); }
@media (max-width: 600px) {
.drawer-smart-grid { grid-template-columns: 1fr; }
.drawer-drive-meta { display: none; }
}
/* -----------------------------------------------------------------------
Stage raw log output (SSH mode)
----------------------------------------------------------------------- */
.stage-log {
font-family: "SF Mono", "Consolas", "Monaco", monospace;
font-size: 11px;
line-height: 1.5;
color: var(--text-muted);
background: var(--bg);
border-left: 2px solid var(--border);
margin: 6px 0 2px 28px;
padding: 6px 10px;
white-space: pre-wrap;
word-break: break-all;
max-height: 200px;
overflow-y: auto;
}
.stage-log .log-bad-block {
color: var(--red);
font-weight: 600;
}
.stage-log .log-warn {
color: var(--yellow);
}
/* -----------------------------------------------------------------------
SMART attributes table in drawer
----------------------------------------------------------------------- */
.smart-attrs {
margin-top: 12px;
border-top: 1px solid var(--border);
padding-top: 10px;
}
.smart-attrs-title {
font-size: 11px;
font-weight: 600;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .05em;
margin-bottom: 6px;
}
.smart-attr-row {
display: flex;
justify-content: space-between;
align-items: center;
padding: 3px 0;
font-size: 12px;
border-bottom: 1px solid color-mix(in srgb, var(--border) 50%, transparent);
}
.smart-attr-row:last-child { border-bottom: none; }
.smart-attr-name { color: var(--text-muted); }
.smart-attr-val { font-family: "SF Mono", monospace; font-size: 12px; }
.smart-attr-val.attr-ok { color: var(--green); }
.smart-attr-val.attr-warn { color: var(--yellow); font-weight: 600; }
.smart-attr-val.attr-fail { color: var(--red); font-weight: 600; }
.smart-attr-raw-output {
font-family: "SF Mono", "Consolas", monospace;
font-size: 10.5px;
line-height: 1.45;
color: var(--text-muted);
background: var(--bg);
border: 1px solid var(--border);
border-radius: 4px;
padding: 8px 10px;
margin-top: 10px;
white-space: pre;
overflow: auto;
max-height: 240px;
}
/* -----------------------------------------------------------------------
Reset button
----------------------------------------------------------------------- */
.btn-reset {
background: transparent;
border: 1px solid color-mix(in srgb, var(--text-muted) 40%, transparent);
color: var(--text-muted);
border-radius: 5px;
padding: 3px 8px;
font-size: 12px;
cursor: pointer;
transition: border-color .15s, color .15s;
}
.btn-reset:hover {
border-color: var(--yellow);
color: var(--yellow);
}
/* -----------------------------------------------------------------------
Parallel burn-in inline warning
----------------------------------------------------------------------- */
.sf-inline-warn {
background: color-mix(in srgb, var(--yellow) 12%, transparent);
border: 1px solid color-mix(in srgb, var(--yellow) 40%, transparent);
border-radius: 5px;
color: var(--yellow);
font-size: 12px;
padding: 7px 10px;
margin: 4px 0 8px 0;
}
/* -----------------------------------------------------------------------
SSH textarea
----------------------------------------------------------------------- */
.sf-textarea {
resize: vertical;
min-height: 90px;
font-family: "SF Mono", "Consolas", monospace;
font-size: 11px;
}
/* -----------------------------------------------------------------------
Version badge in header
----------------------------------------------------------------------- */
.header-version {
font-size: 10px;
color: var(--text-muted);
opacity: .55;
font-weight: 400;
letter-spacing: 0;
align-self: flex-end;
padding-bottom: 1px;
font-variant-numeric: tabular-nums;
}
/* -----------------------------------------------------------------------
Live Terminal drawer panel (xterm.js)
----------------------------------------------------------------------- */
.drawer-panel-terminal {
padding: 0 !important;
overflow: hidden !important;
position: relative;
background: #0d1117;
}
/* Let xterm fill the full panel height */
.drawer-panel-terminal .xterm {
height: 100%;
}
.drawer-panel-terminal .xterm-viewport {
overflow-y: auto !important;
}
/* Reconnect bar — floats over the terminal when disconnected */
.term-reconnect-bar {
position: absolute;
bottom: 12px;
right: 12px;
z-index: 20;
display: flex;
align-items: center;
gap: 8px;
background: rgba(13,17,23,0.85);
border: 1px solid var(--border);
border-radius: 6px;
padding: 6px 10px;
font-size: 12px;
color: var(--text-muted);
}
.term-reconnect-bar .btn-secondary {
padding: 3px 10px;
font-size: 11px;
}

View file

@ -69,6 +69,10 @@
restoreCheckboxes();
initElapsedTimers();
initLocationEdits();
if (_drawerDriveId) {
_drawerHighlightRow(_drawerDriveId);
drawerFetch(_drawerDriveId);
}
});
updateCounts();
@ -131,14 +135,59 @@
if (nb) nb.style.display = 'none';
}
// Handle job-alert SSE events for browser notifications
// Handle SSE events
document.addEventListener('htmx:sseMessage', function (e) {
if (!e.detail || e.detail.type !== 'job-alert') return;
try {
handleJobAlert(JSON.parse(e.detail.data));
} catch (_) {}
if (!e.detail) return;
if (e.detail.type === 'job-alert') {
try { handleJobAlert(JSON.parse(e.detail.data)); } catch (_) {}
} else if (e.detail.type === 'system-sensors') {
try { handleSystemSensors(JSON.parse(e.detail.data)); } catch (_) {}
}
});
function handleSystemSensors(data) {
var st = data.system_temps || {};
var tp = data.thermal_pressure || 'ok';
var warn = data.temp_warn_c || 46;
var crit = data.temp_crit_c || 55;
function tempClass(c) {
if (c == null) return '';
return c >= crit ? 'temp-hot' : c >= warn ? 'temp-warm' : 'temp-cool';
}
// CPU chip
var cpuChip = document.getElementById('sensor-cpu');
var cpuVal = document.getElementById('sensor-cpu-val');
if (cpuVal && st.cpu_c != null) {
if (cpuChip) cpuChip.hidden = false;
cpuVal.textContent = st.cpu_c + '°';
cpuVal.className = 'stat-sensor-val ' + tempClass(st.cpu_c);
}
// PCH chip
var pchChip = document.getElementById('sensor-pch');
var pchVal = document.getElementById('sensor-pch-val');
if (pchVal && st.pch_c != null) {
if (pchChip) pchChip.hidden = false;
pchVal.textContent = st.pch_c + '°';
pchVal.className = 'stat-sensor-val ' + tempClass(st.pch_c);
}
// Thermal pressure chip
var tChip = document.getElementById('sensor-thermal');
var tVal = document.getElementById('sensor-thermal-val');
if (tChip && tVal) {
if (tp === 'warn' || tp === 'crit') {
tChip.hidden = false;
tChip.className = 'stat-sensor stat-sensor-thermal stat-sensor-thermal-' + tp;
tVal.textContent = tp === 'warn' ? 'WARM' : 'HOT';
} else {
tChip.hidden = true;
}
}
}
function handleJobAlert(data) {
var isPass = data.state === 'passed';
var icon = isPass ? '✓' : '✕';
@ -842,7 +891,458 @@
if (modal && !modal.hidden) { closeModal(); return; }
var bModal = document.getElementById('batch-modal');
if (bModal && !bModal.hidden) { closeBatchModal(); return; }
if (_drawerDriveId) { closeDrawer(); return; }
}
});
// -----------------------------------------------------------------------
// Log Drawer
// -----------------------------------------------------------------------
var _drawerDriveId = null;
var _drawerTab = 'burnin';
function openDrawer(driveId) {
if (_drawerDriveId === driveId) { closeDrawer(); return; }
_drawerDriveId = driveId;
var drawer = document.getElementById('log-drawer');
drawer.removeAttribute('hidden');
document.body.classList.add('drawer-open');
_drawerHighlightRow(driveId);
drawerFetch(driveId);
}
function closeDrawer() {
_drawerDriveId = null;
var drawer = document.getElementById('log-drawer');
drawer.setAttribute('hidden', '');
document.body.classList.remove('drawer-open');
document.querySelectorAll('tr.drawer-row-active').forEach(function (r) {
r.classList.remove('drawer-row-active');
});
}
function _drawerHighlightRow(driveId) {
document.querySelectorAll('tr.drawer-row-active').forEach(function (r) {
r.classList.remove('drawer-row-active');
});
var row = document.getElementById('drive-' + driveId);
if (row) row.classList.add('drawer-row-active');
}
async function drawerFetch(driveId) {
['burnin', 'smart', 'events'].forEach(function (tab) {
var p = document.getElementById('drawer-panel-' + tab);
if (p && !p.innerHTML.trim()) {
p.innerHTML = '<div class="drawer-loading">Loading\u2026</div>';
}
});
try {
var resp = await fetch('/api/v1/drives/' + driveId + '/drawer');
if (!resp.ok) throw new Error('HTTP ' + resp.status);
var data = await resp.json();
_drawerRender(data);
} catch (e) {
['burnin', 'smart', 'events'].forEach(function (tab) {
var p = document.getElementById('drawer-panel-' + tab);
if (p) p.innerHTML = '<div class="drawer-loading" style="color:var(--red)">Failed to load.</div>';
});
}
}
function _drawerRender(data) {
var drive = data.drive || {};
var devnameEl = document.getElementById('drawer-devname');
var metaEl = document.getElementById('drawer-drive-meta');
if (devnameEl) devnameEl.textContent = drive.devname || '\u2014';
if (metaEl) {
var meta = drive.model || '';
if (drive.serial) meta += ' \u00b7 ' + drive.serial;
metaEl.textContent = meta;
}
_drawerRenderBurnin(data.burnin);
_drawerRenderSmart(data.smart);
_drawerRenderEvents(data.events);
}
function _drawerRenderBurnin(burnin) {
var panel = document.getElementById('drawer-panel-burnin');
if (!panel) return;
if (!burnin) {
panel.innerHTML = '<div class="drawer-empty">No burn-in history for this drive.</div>';
return;
}
var html = '<div class="drawer-job-header">';
html += '<span class="chip chip-' + _esc(burnin.state) + '">' + _esc(burnin.state.toUpperCase()) + '</span>';
html += '<span class="drawer-job-meta">';
if (burnin.operator) html += 'by ' + _esc(burnin.operator);
if (burnin.started_at) html += ' \u00b7 ' + _drawerFmtDt(burnin.started_at);
html += '</span></div>';
html += '<div class="drawer-stages">';
var stages = burnin.stages || [];
if (stages.length) {
stages.forEach(function (s) {
html += '<div class="drawer-stage stage-' + _esc(s.state) + '">';
html += '<div class="stage-row-header">';
html += '<span class="stage-icon">' + _drawerStageIcon(s.state) + '</span>';
html += '<span class="stage-name-label">' + _esc(_drawerStageName(s.stage_name)) + '</span>';
if (s.state === 'running') {
html += '<span class="stage-pct">' + (s.percent || 0) + '%</span>';
if (s.started_at) {
html += '<span class="elapsed-timer" data-started="' + _esc(s.started_at) + '"></span>';
}
html += '<span class="stage-cursor">\u258a</span>';
} else if (s.finished_at && s.started_at) {
html += '<span class="stage-duration">' + _drawerFmtDuration(s.started_at, s.finished_at) + '</span>';
}
html += '</div>';
if (s.error_text) {
html += '<div class="stage-error-line">' + _esc(s.error_text) + '</div>';
}
// Raw SSH log output (if available)
if (s.log_text) {
var logHtml = _esc(s.log_text)
.replace(/^(\d+)\s*$/gm, '<span class="log-bad-block">$1 ← BAD BLOCK</span>')
.replace(/\[WARNING\][^\n]*/g, '<span class="log-warn">$&</span>');
html += '<pre class="stage-log">' + logHtml + '</pre>';
}
// Bad block count badge
if (s.bad_blocks && s.bad_blocks > 0) {
html += '<div class="stage-error-line">' + s.bad_blocks + ' bad block(s) found</div>';
}
html += '</div>';
});
} else {
html += '<div class="drawer-empty">No stage data yet.</div>';
}
html += '</div>';
var wasAtBottom = panel.scrollHeight - panel.scrollTop <= panel.clientHeight + 5;
panel.innerHTML = html;
tickElapsedTimers();
var autoScroll = document.getElementById('autoscroll-toggle');
if (autoScroll && autoScroll.checked && wasAtBottom) {
panel.scrollTop = panel.scrollHeight;
}
}
// Monitored SMART attributes for inline colouring
var _SMART_CRITICAL = {5: true, 197: true, 198: true};
var _SMART_WARN = {10: true, 188: true, 199: true};
function _drawerRenderSmart(smart) {
var panel = document.getElementById('drawer-panel-smart');
if (!panel) return;
var html = '<div class="drawer-smart-grid">';
['short', 'long'].forEach(function (type) {
var t = smart ? smart[type] : null;
var label = type === 'short' ? 'Short SMART' : 'Long SMART';
html += '<div class="smart-card">';
html += '<div class="smart-card-label">' + label + '</div>';
if (!t || !t.state || t.state === 'idle') {
html += '<span class="chip chip-unknown">Not run</span>';
} else {
html += '<span class="chip chip-' + _esc(t.state) + '">' + _esc(t.state.toUpperCase()) + '</span>';
if (t.state === 'running') {
html += '<div class="smart-progress"><div class="progress-bar"><div class="progress-fill" style="width:' + (t.percent || 0) + '%"></div></div>'
+ '<span style="font-size:12px;color:var(--blue)">' + (t.percent || 0) + '%</span></div>';
}
if (t.started_at) html += '<div class="smart-detail">Started: ' + _drawerFmtDt(t.started_at) + '</div>';
if (t.finished_at) html += '<div class="smart-detail">Finished: ' + _drawerFmtDt(t.finished_at) + '</div>';
if (t.error_text) html += '<div class="stage-error-line">' + _esc(t.error_text) + '</div>';
// Raw smartctl output (SSH mode)
if (t.raw_output) {
html += '<pre class="smart-attr-raw-output">' + _esc(t.raw_output) + '</pre>';
}
}
html += '</div>';
});
html += '</div>';
// SMART attribute table (from SSH attribute parse)
var attrs = smart && smart.attrs;
if (attrs) {
html += '<div class="smart-attrs">';
html += '<div class="smart-attrs-title">SMART Attributes</div>';
if (attrs.failures && attrs.failures.length) {
html += '<div class="stage-error-line" style="margin-bottom:6px">✕ Failures: ' + _esc(attrs.failures.join('; ')) + '</div>';
}
if (attrs.warnings && attrs.warnings.length) {
html += '<div class="stage-error-line" style="color:var(--yellow);margin-bottom:6px">⚠ Warnings: ' + _esc(attrs.warnings.join('; ')) + '</div>';
}
var attrMap = attrs.attrs || {};
var monitoredIds = [5, 10, 188, 197, 198, 199];
monitoredIds.forEach(function (id) {
var entry = attrMap[String(id)];
if (!entry) return;
var raw = entry.raw;
var cls = raw > 0 ? (_SMART_CRITICAL[id] ? 'attr-fail' : 'attr-warn') : 'attr-ok';
html += '<div class="smart-attr-row">';
html += '<span class="smart-attr-name">' + id + ' ' + _esc(entry.name) + '</span>';
html += '<span class="smart-attr-val ' + cls + '">' + raw + '</span>';
html += '</div>';
});
html += '</div>';
}
panel.innerHTML = html;
}
function _drawerRenderEvents(events) {
var panel = document.getElementById('drawer-panel-events');
if (!panel) return;
if (!events || events.length === 0) {
panel.innerHTML = '<div class="drawer-empty">No events recorded for this drive.</div>';
return;
}
var html = '<div class="drawer-events">';
events.forEach(function (ev) {
var isErr = (ev.event_type || '').indexOf('fail') !== -1 || (ev.event_type || '').indexOf('stuck') !== -1;
html += '<div class="drawer-event' + (isErr ? ' event-error' : '') + '">';
html += '<span class="event-time">' + _drawerFmtDt(ev.created_at) + '</span>';
html += '<span class="event-type">' + _esc(ev.event_type || '') + '</span>';
if (ev.message) html += '<span class="event-message">' + _esc(ev.message) + '</span>';
if (ev.operator) html += '<span class="event-operator">by ' + _esc(ev.operator) + '</span>';
html += '</div>';
});
html += '</div>';
panel.innerHTML = html;
}
function _esc(s) {
return String(s == null ? '' : s)
.replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;').replace(/"/g, '&quot;');
}
function _drawerFmtDt(iso) {
if (!iso) return '';
try { return new Date(iso).toLocaleString(); } catch (e) { return iso; }
}
function _drawerFmtDuration(startIso, endIso) {
try {
var secs = Math.max(0, Math.floor((new Date(endIso) - new Date(startIso)) / 1000));
var h = Math.floor(secs / 3600), m = Math.floor((secs % 3600) / 60), s = secs % 60;
if (h > 0) return h + 'h ' + m + 'm';
if (m > 0) return m + 'm ' + s + 's';
return s + 's';
} catch (e) { return ''; }
}
function _drawerStageName(name) {
return (name || '').replace(/_/g, ' ').replace(/\b\w/g, function (c) { return c.toUpperCase(); });
}
function _drawerStageIcon(state) {
return { passed: '\u2713', failed: '\u2715', running: '\u25b6', cancelled: '\u25fc', pending: '\u25cb', skipped: '\u2014' }[state] || '\u25cb';
}
// Row click → open drawer (ignore interactive elements)
document.addEventListener('click', function (e) {
if (e.target.closest('button, input, label, a, .drive-location')) return;
var row = e.target.closest('#drives-tbody tr[id^="drive-"]');
if (!row) return;
openDrawer(row.id.replace('drive-', ''));
});
// Tab switching
document.addEventListener('click', function (e) {
var btn = e.target.closest('.drawer-tab');
if (!btn) return;
_drawerTab = btn.dataset.tab;
document.querySelectorAll('.drawer-tab').forEach(function (b) {
b.classList.toggle('active', b.dataset.tab === _drawerTab);
});
document.querySelectorAll('.drawer-panel').forEach(function (p) {
p.classList.toggle('active', p.id === 'drawer-panel-' + _drawerTab);
});
// Terminal tab: init/fit on activation; hide autoscroll (N/A for terminal)
var asl = document.querySelector('.autoscroll-label');
if (_drawerTab === 'terminal') {
if (asl) asl.style.visibility = 'hidden';
openTerminalTab();
} else {
if (asl) asl.style.visibility = '';
}
});
// Close button
document.addEventListener('click', function (e) {
if (e.target.closest('#drawer-close-btn')) closeDrawer();
});
// Reset button — clears SMART state for a drive
document.addEventListener('click', function (e) {
var btn = e.target.closest('.btn-reset');
if (!btn) return;
var driveId = btn.dataset.driveId;
if (!driveId) return;
var operator = (window._operator || 'operator');
fetch('/api/v1/drives/' + driveId + '/reset', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ operator: operator }),
}).then(function (r) {
if (!r.ok) return r.json().then(function (d) { showToast(d.detail || 'Reset failed', 'error'); });
showToast('Drive reset — state cleared', 'success');
}).catch(function () { showToast('Network error', 'error'); });
});
// -----------------------------------------------------------------------
// Live Terminal (xterm.js + SSH WebSocket)
// -----------------------------------------------------------------------
var _xtermReady = false; // xterm.js + FitAddon libraries loaded
var _terminal = null; // xterm.js Terminal instance
var _termFit = null; // FitAddon instance
var _termWs = null; // active WebSocket (null = disconnected)
function _loadXtermLibs(cb) {
var link = document.createElement('link');
link.rel = 'stylesheet';
link.href = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/css/xterm.css';
document.head.appendChild(link);
var s1 = document.createElement('script');
s1.src = 'https://cdn.jsdelivr.net/npm/xterm@5.3.0/lib/xterm.js';
s1.onload = function () {
var s2 = document.createElement('script');
s2.src = 'https://cdn.jsdelivr.net/npm/xterm-addon-fit@0.8.0/lib/xterm-addon-fit.js';
s2.onload = cb;
document.head.appendChild(s2);
};
document.head.appendChild(s1);
}
function openTerminalTab() {
var panel = document.getElementById('drawer-panel-terminal');
if (!panel) return;
if (!_xtermReady) {
panel.innerHTML = '<div class="drawer-loading">Loading terminal\u2026</div>';
_loadXtermLibs(function () {
_xtermReady = true;
_termInit(panel);
});
return;
}
if (!_terminal) {
_termInit(panel);
return;
}
// Already initialised — refit to current panel dimensions
setTimeout(function () {
if (_termFit) try { _termFit.fit(); } catch (_) {}
}, 30);
}
function _termInit(panel) {
panel.innerHTML = '';
var term = new Terminal({
cursorBlink: true,
fontSize: 13,
fontFamily: '"SF Mono","Fira Code",Consolas,"DejaVu Sans Mono",monospace',
theme: {
background: '#0d1117',
foreground: '#e6edf3',
cursor: '#58a6ff',
cursorAccent: '#0d1117',
selectionBackground: 'rgba(88,166,255,0.25)',
black: '#484f58', red: '#ff7b72', green: '#3fb950', yellow: '#d29922',
blue: '#58a6ff', magenta: '#bc8cff', cyan: '#39c5cf', white: '#b1bac4',
brightBlack: '#6e7681', brightRed: '#ffa198', brightGreen: '#56d364',
brightYellow: '#e3b341', brightBlue: '#79c0ff', brightMagenta: '#d2a8ff',
brightCyan: '#56d4dd', brightWhite: '#f0f6fc',
},
scrollback: 2000,
allowProposedApi: true,
});
var fit = new FitAddon.FitAddon();
term.loadAddon(fit);
term.open(panel);
_terminal = term;
_termFit = fit;
// Initial fit after the panel is visible
setTimeout(function () {
if (_termFit) try { _termFit.fit(); } catch (_) {}
}, 30);
// Forward all keystrokes → SSH (onData registered once here)
term.onData(function (data) {
if (_termWs && _termWs.readyState === 1) {
_termWs.send(new TextEncoder().encode(data));
}
});
// Refit + notify server on resize
new ResizeObserver(function () {
if (!_termFit) return;
try { _termFit.fit(); } catch (_) {}
if (_termWs && _termWs.readyState === 1 && _terminal) {
_termWs.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
}
}).observe(panel);
_termConnect();
}
function _termConnect() {
if (_termWs && _termWs.readyState <= 1) return; // already open or connecting
var proto = location.protocol === 'https:' ? 'wss:' : 'ws:';
var ws = new WebSocket(proto + '//' + location.host + '/ws/terminal');
ws.binaryType = 'arraybuffer';
_termWs = ws;
ws.onopen = function () {
_termHideReconnect();
if (_terminal && ws.readyState === 1) {
ws.send(JSON.stringify({ type: 'resize', cols: _terminal.cols, rows: _terminal.rows }));
}
};
ws.onmessage = function (e) {
if (!_terminal) return;
_terminal.write(e.data instanceof ArrayBuffer ? new Uint8Array(e.data) : e.data);
};
ws.onclose = function () {
if (_terminal) _terminal.write('\r\n\x1b[33m\u2500\u2500 disconnected \u2500\u2500\x1b[0m\r\n');
_termShowReconnect();
};
ws.onerror = function () { /* onclose fires too */ };
}
function _termShowReconnect() {
var panel = document.getElementById('drawer-panel-terminal');
if (!panel || panel.querySelector('.term-reconnect-bar')) return;
var bar = document.createElement('div');
bar.className = 'term-reconnect-bar';
bar.innerHTML = '<span>Connection closed</span>'
+ '<button class="btn-secondary">\u21ba Reconnect</button>';
bar.querySelector('button').onclick = function () {
bar.remove();
_termConnect();
};
panel.appendChild(bar);
}
function _termHideReconnect() {
var bar = document.querySelector('.term-reconnect-bar');
if (bar) bar.remove();
}
}());

View file

@ -81,6 +81,10 @@
{%- set short_busy = drive.smart_short and drive.smart_short.state == 'running' %}
{%- set long_busy = drive.smart_long and drive.smart_long.state == 'running' %}
{%- set selectable = not bi_active and not short_busy and not long_busy %}
{%- set bi_done = drive.burnin and drive.burnin.state in ('passed', 'failed', 'cancelled', 'unknown') %}
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy %}
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}">
<td class="col-check">
{%- if selectable %}
@ -160,6 +164,12 @@
data-health="{{ drive.smart_health }}"
{% if short_busy or long_busy %}disabled{% endif %}
title="Start Burn-In">Burn-In</button>
<!-- Reset — clears SMART state so drive can be re-tested from scratch -->
{%- if can_reset %}
<button class="btn-action btn-reset"
data-drive-id="{{ drive.id }}"
title="Reset SMART state — clears test results so drive shows as fresh">Reset</button>
{%- endif %}
{%- endif %}
</div>
</td>

View file

@ -6,7 +6,7 @@
{% include "components/modal_start.html" %}
{% include "components/modal_batch.html" %}
<!-- Stats bar — counts are updated live by app.js updateCounts() -->
<!-- Stats bar — drive counts updated live by app.js updateCounts(); sensor chips updated by SSE system-sensors event -->
<div class="stats-bar">
<div class="stat-card" data-stat-filter="all">
<span class="stat-value" id="stat-all">{{ drives | length }}</span>
@ -28,6 +28,33 @@
<span class="stat-value" id="stat-idle">0</span>
<span class="stat-label">Idle</span>
</div>
{%- set st = poller.system_temps if (poller and poller.system_temps) else {} %}
{%- if st.get('cpu_c') is not none or st.get('pch_c') is not none %}
<div class="stats-bar-sep"></div>
{%- if st.get('cpu_c') is not none %}
<div class="stat-sensor" id="sensor-cpu">
<span class="stat-sensor-val {{ st.get('cpu_c') | temp_class }}" id="sensor-cpu-val">{{ st.get('cpu_c') }}°</span>
<span class="stat-sensor-label">CPU</span>
</div>
{%- endif %}
{%- if st.get('pch_c') is not none %}
<div class="stat-sensor" id="sensor-pch">
<span class="stat-sensor-val {{ st.get('pch_c') | temp_class }}" id="sensor-pch-val">{{ st.get('pch_c') }}°</span>
<span class="stat-sensor-label">PCH</span>
</div>
{%- endif %}
{%- endif %}
{%- set tp = poller.thermal_pressure if poller else 'ok' %}
<div class="stat-sensor stat-sensor-thermal stat-sensor-thermal-{{ tp }}"
id="sensor-thermal"
{% if not tp or tp == 'ok' %}hidden{% endif %}>
<span class="stat-sensor-val" id="sensor-thermal-val">
{%- if tp == 'warn' %}WARM{%- elif tp == 'crit' %}HOT{%- else %}OK{%- endif %}
</span>
<span class="stat-sensor-label">Thermal</span>
</div>
</div>
<!-- Failed drive banner — shown/hidden by JS when failed count > 0 -->
@ -71,4 +98,33 @@
</div>
</div>
</div>
<!-- Log Drawer (fixed, lives outside SSE swap area) -->
<div id="log-drawer" class="log-drawer" hidden>
<div class="drawer-header">
<div class="drawer-drive-info">
<span class="drawer-devname" id="drawer-devname"></span>
<span class="drawer-drive-meta" id="drawer-drive-meta"></span>
</div>
<nav class="drawer-tabs">
<button class="drawer-tab active" data-tab="burnin">Burn-In</button>
<button class="drawer-tab" data-tab="smart">SMART</button>
<button class="drawer-tab" data-tab="events">Events</button>
<button class="drawer-tab" data-tab="terminal">Terminal</button>
</nav>
<div class="drawer-controls">
<label class="autoscroll-label">
<input type="checkbox" id="autoscroll-toggle" checked>
<span>Auto-scroll</span>
</label>
<button class="drawer-close" id="drawer-close-btn" title="Close (Esc)"></button>
</div>
</div>
<div class="drawer-body">
<div class="drawer-panel active" id="drawer-panel-burnin"></div>
<div class="drawer-panel" id="drawer-panel-smart"></div>
<div class="drawer-panel" id="drawer-panel-events"></div>
<div class="drawer-panel drawer-panel-terminal" id="drawer-panel-terminal"></div>
</div>
</div>
{% endblock %}

View file

@ -32,6 +32,7 @@
<th>State</th>
<th>Operator</th>
<th>Started</th>
<th>Completed</th>
<th>Duration</th>
<th>Error</th>
<th class="col-actions"></th>
@ -54,6 +55,7 @@
</td>
<td class="text-muted">{{ j.operator or '—' }}</td>
<td class="mono text-muted">{{ j.started_at | format_dt_full }}</td>
<td class="mono text-muted">{{ j.finished_at | format_dt_full }}</td>
<td class="mono text-muted">{{ j.duration_seconds | format_duration }}</td>
<td class="error-cell">
{% if j.error_text %}
@ -67,7 +69,7 @@
{% endfor %}
{% else %}
<tr>
<td colspan="9" class="empty-state">No burn-in jobs found.</td>
<td colspan="10" class="empty-state">No burn-in jobs found.</td>
</tr>
{% endif %}
</tbody>

View file

@ -17,6 +17,7 @@
<line x1="6" y1="18" x2="6.01" y2="18"></line>
</svg>
<span class="header-title">TrueNAS Burn-In</span>
<span class="header-version">v{{ app_version if app_version is defined else '—' }}</span>
</a>
<div class="header-meta">
<span class="live-indicator">

View file

@ -6,12 +6,14 @@
<div class="page-toolbar">
<h1 class="page-title">Settings</h1>
<div class="toolbar-right">
<a class="btn-export" href="/docs" target="_blank" rel="noopener">API Docs</a>
<button type="button" id="check-updates-btn" class="btn-secondary">Check for Updates</button>
<span id="update-result" class="settings-test-result" style="display:none;margin-left:8px"></span>
<a class="btn-export" href="/docs" target="_blank" rel="noopener" style="margin-left:8px">API Docs</a>
</div>
</div>
<p class="page-subtitle">
Changes take effect immediately. Settings marked
<span class="badge-restart">restart required</span> must be changed in <code>.env</code>.
<span class="badge-restart">restart required</span> are saved but need a container restart to fully apply.
</p>
<form id="settings-form" autocomplete="off">
@ -89,6 +91,57 @@
</div>
</div>
<!-- SSH -->
<div class="settings-card">
<div class="settings-card-header">
<span class="settings-card-title">SSH (TrueNAS Direct)</span>
{% if ssh_configured %}
<span class="chip chip-passed" style="font-size:10px">Configured</span>
{% else %}
<span class="chip chip-unknown" style="font-size:10px">Not configured — using REST API / mock</span>
{% endif %}
</div>
<p class="sf-hint" style="margin-bottom:8px">
When configured, burn-in stages run smartctl and badblocks directly on TrueNAS over SSH,
enabling SMART attribute monitoring and real bad-block detection. Leave Host empty to use
the TrueNAS REST API (mock / dev mode).
</p>
<div class="sf-fields">
<div class="sf-full sf-row-test" style="margin-bottom:4px">
<button type="button" id="test-ssh-btn" class="btn-secondary">Test SSH Connection</button>
<span id="ssh-test-result" class="settings-test-result" style="display:none"></span>
</div>
<label for="ssh_host">Host / IP</label>
<input class="sf-input" id="ssh_host" name="ssh_host" type="text"
value="{{ editable.ssh_host }}" placeholder="10.0.0.x (same as TrueNAS IP)">
<label for="ssh_port">Port</label>
<input class="sf-input sf-input-xs" id="ssh_port" name="ssh_port"
type="number" min="1" max="65535" value="{{ editable.ssh_port }}" style="width:70px">
<label for="ssh_user">Username</label>
<input class="sf-input" id="ssh_user" name="ssh_user" type="text"
value="{{ editable.ssh_user }}" placeholder="root">
<label for="ssh_password">Password</label>
<input class="sf-input" id="ssh_password" name="ssh_password" type="password"
placeholder="leave blank to keep existing" autocomplete="new-password">
<label for="ssh_key">Private Key</label>
<div>
<textarea class="sf-input sf-textarea" id="ssh_key" name="ssh_key"
rows="6" placeholder="Paste PEM private key here (-----BEGIN ... KEY-----). Leave blank to keep existing." autocomplete="off"></textarea>
<span class="sf-hint" style="margin-top:3px">
Either password or key auth. Key takes precedence if both are set.
Key is stored securely in <code>/data/settings_overrides.json</code>.
</span>
</div>
</div>
</div>
</div><!-- /left col -->
<!-- RIGHT column: Notifications + Behavior -->
@ -157,9 +210,14 @@
<div class="sf-row">
<label class="sf-label" for="max_parallel_burnins">Max Parallel Burn-Ins</label>
<input class="sf-input sf-input-xs" id="max_parallel_burnins" name="max_parallel_burnins"
type="number" min="1" max="16" value="{{ editable.max_parallel_burnins }}">
type="number" min="1" max="60" value="{{ editable.max_parallel_burnins }}">
<span class="sf-hint">How many jobs can run at the same time</span>
</div>
<div id="parallel-warn" class="sf-inline-warn"
{% if editable.max_parallel_burnins <= 8 %}style="display:none"{% endif %}>
⚠ Running many simultaneous surface scans may saturate your storage controller
and produce unreliable results. Recommended: 24.
</div>
<div class="sf-row">
<label class="sf-label" for="stuck_job_hours">Stuck Job Threshold (hours)</label>
@ -167,52 +225,101 @@
type="number" min="1" max="168" value="{{ editable.stuck_job_hours }}">
<span class="sf-hint">Jobs running longer than this → auto-marked unknown</span>
</div>
<div class="sf-divider"></div>
<div class="sf-row">
<label class="sf-label" for="temp_warn_c">Temp Warning (°C)</label>
<input class="sf-input sf-input-xs" id="temp_warn_c" name="temp_warn_c"
type="number" min="20" max="80" value="{{ editable.temp_warn_c }}">
<span class="sf-hint">Show orange above this temperature</span>
</div>
<div class="sf-row">
<label class="sf-label" for="temp_crit_c">Temp Critical (°C)</label>
<input class="sf-input sf-input-xs" id="temp_crit_c" name="temp_crit_c"
type="number" min="20" max="80" value="{{ editable.temp_crit_c }}">
<span class="sf-hint">Show red + block burn-in start above this temperature</span>
</div>
<div class="sf-row">
<label class="sf-label" for="bad_block_threshold">Bad Block Threshold</label>
<input class="sf-input sf-input-xs" id="bad_block_threshold" name="bad_block_threshold"
type="number" min="0" max="9999" value="{{ editable.bad_block_threshold }}">
<span class="sf-hint">Max bad blocks before surface validate fails (Stage 7)</span>
</div>
</div>
</div><!-- /right col -->
</div><!-- /two-col -->
<!-- System settings (restart required) -->
<div class="settings-card" style="margin-top:16px">
<div class="settings-card-header">
<span class="settings-card-title">System</span>
<span class="badge-restart">restart required to apply</span>
</div>
<div class="settings-two-col" style="gap:16px">
<div class="sf-fields">
<label for="truenas_base_url">TrueNAS URL</label>
<input class="sf-input" id="truenas_base_url" name="truenas_base_url" type="text"
value="{{ editable.truenas_base_url }}" placeholder="http://10.0.0.x">
<label for="truenas_api_key">API Key</label>
<input class="sf-input" id="truenas_api_key" name="truenas_api_key" type="password"
placeholder="leave blank to keep existing" autocomplete="new-password">
<label for="truenas_verify_tls">Verify TLS</label>
<label class="toggle" style="margin-top:2px">
<input type="checkbox" id="truenas_verify_tls" name="truenas_verify_tls"
{% if editable.truenas_verify_tls %}checked{% endif %}>
<span class="toggle-slider"></span>
</label>
</div>
<div class="sf-fields">
<label for="poll_interval_seconds">Poll Interval (s)</label>
<input class="sf-input sf-input-xs" id="poll_interval_seconds" name="poll_interval_seconds"
type="number" min="1" max="300" value="{{ editable.poll_interval_seconds }}">
<label for="stale_threshold_seconds">Stale Threshold (s)</label>
<input class="sf-input sf-input-xs" id="stale_threshold_seconds" name="stale_threshold_seconds"
type="number" min="1" max="600" value="{{ editable.stale_threshold_seconds }}">
<label for="log_level">Log Level</label>
<select class="sf-select" id="log_level" name="log_level">
{% for lvl in ['DEBUG','INFO','WARNING','ERROR','CRITICAL'] %}
<option value="{{ lvl }}" {% if editable.log_level == lvl %}selected{% endif %}>{{ lvl }}</option>
{% endfor %}
</select>
<label for="allowed_ips">IP Allowlist</label>
<div>
<input class="sf-input" id="allowed_ips" name="allowed_ips" type="text"
value="{{ editable.allowed_ips }}" placeholder="10.0.0.0/24,127.0.0.1 (empty = allow all)">
<span class="sf-hint" style="margin-top:3px">Comma-separated IPs/CIDRs. Empty = allow all.</span>
</div>
</div>
</div>
</div>
<!-- Save row -->
<div class="settings-save-bar">
<button type="submit" class="btn-primary" id="save-btn">Save Settings</button>
<button type="button" class="btn-secondary" id="cancel-settings-btn">Cancel</button>
<span id="save-result" class="settings-test-result" style="display:none"></span>
</div>
</form>
<!-- System (read-only) -->
<div class="settings-card settings-card-readonly">
<div class="settings-card-header">
<span class="settings-card-title">System</span>
<span class="badge-restart">restart required to change</span>
</div>
<div class="sf-readonly-grid">
<div class="sf-ro-row">
<span class="sf-ro-label">TrueNAS URL</span>
<span class="sf-ro-value mono">{{ readonly.truenas_base_url }}</span>
</div>
<div class="sf-ro-row">
<span class="sf-ro-label">Verify TLS</span>
<span class="sf-ro-value">{{ 'Yes' if readonly.truenas_verify_tls else 'No' }}</span>
</div>
<div class="sf-ro-row">
<span class="sf-ro-label">Poll Interval</span>
<span class="sf-ro-value mono">{{ readonly.poll_interval_seconds }}s</span>
</div>
<div class="sf-ro-row">
<span class="sf-ro-label">Stale Threshold</span>
<span class="sf-ro-value mono">{{ readonly.stale_threshold_seconds }}s</span>
</div>
<div class="sf-ro-row">
<span class="sf-ro-label">IP Allowlist</span>
<span class="sf-ro-value mono">{{ readonly.allowed_ips }}</span>
</div>
<div class="sf-ro-row">
<span class="sf-ro-label">Log Level</span>
<span class="sf-ro-value mono">{{ readonly.log_level }}</span>
</div>
</div>
<!-- Restart required banner — shown after saving system settings -->
<div id="restart-banner" style="display:none;margin-top:12px;padding:12px 16px;background:rgba(255,170,0,0.12);border:1px solid var(--yellow);border-radius:8px;color:var(--text-strong)">
<strong>&#9888; Container restart required</strong> — system settings are saved but won't take effect until you restart the app container:
<pre style="margin:8px 0 0;padding:8px 10px;background:var(--bg-card);border-radius:5px;font-size:12px;color:var(--text-strong);user-select:all">docker compose restart app</pre>
<span style="font-size:11px;color:var(--text-muted)">Run this on <strong>maple.local</strong> from <code>~/docker/stacks/truenas-burnin/</code></span>
</div>
</form>
<script>
(function () {
@ -260,7 +367,14 @@
});
var data = await resp.json();
if (resp.ok) {
// Show restart notice if any system settings were saved
var systemFields = ['truenas_base_url','truenas_api_key','truenas_verify_tls',
'poll_interval_seconds','stale_threshold_seconds','allowed_ips','log_level'];
var savedKeys = data.keys || [];
var needsRestart = savedKeys.some(function(k) { return systemFields.indexOf(k) >= 0; });
showResult(saveResult, true, 'Saved');
var restartBanner = document.getElementById('restart-banner');
if (restartBanner) restartBanner.style.display = needsRestart ? '' : 'none';
} else {
showResult(saveResult, false, data.detail || 'Save failed');
}
@ -298,6 +412,62 @@
testBtn.textContent = 'Test Connection';
}
});
// Parallel burn-in warning
var parallelInput = document.getElementById('max_parallel_burnins');
var parallelWarn = document.getElementById('parallel-warn');
if (parallelInput && parallelWarn) {
parallelInput.addEventListener('input', function () {
parallelWarn.style.display = parseInt(parallelInput.value, 10) > 8 ? '' : 'none';
});
}
// Test SSH
var sshBtn = document.getElementById('test-ssh-btn');
var sshResult = document.getElementById('ssh-test-result');
if (sshBtn) {
sshBtn.addEventListener('click', async function () {
sshBtn.disabled = true;
sshBtn.textContent = 'Testing…';
sshResult.style.display = 'none';
try {
var resp = await fetch('/api/v1/settings/test-ssh', { method: 'POST' });
var data = await resp.json();
showResult(sshResult, resp.ok, resp.ok ? 'Connection OK' : (data.detail || 'Failed'));
} catch (e) {
showResult(sshResult, false, 'Network error');
} finally {
sshBtn.disabled = false;
sshBtn.textContent = 'Test SSH Connection';
}
});
}
// Check for Updates
var updBtn = document.getElementById('check-updates-btn');
var updResult = document.getElementById('update-result');
updBtn.addEventListener('click', async function () {
updBtn.disabled = true;
updBtn.textContent = 'Checking…';
updResult.style.display = 'none';
try {
var resp = await fetch('/api/v1/updates/check');
var data = await resp.json();
if (data.update_available) {
showResult(updResult, false, 'Update available: v' + data.latest + ' (current: v' + data.current + ')');
} else if (data.latest) {
showResult(updResult, true, 'Up to date (v' + data.current + ')');
} else {
var msg = data.message || ('v' + data.current + ' — no releases found');
showResult(updResult, true, msg);
}
} catch (e) {
showResult(updResult, false, 'Network error');
} finally {
updBtn.disabled = false;
updBtn.textContent = 'Check for Updates';
}
});
}());
</script>
{% endblock %}

View file

@ -119,5 +119,65 @@
{% endif %}
</div>
</div>
<div class="stats-grid" style="margin-top:24px">
<!-- Average duration by drive size -->
<div class="stats-section">
<h2 class="section-title">Avg. Test Duration by Drive Size</h2>
{% if by_size %}
<div class="table-wrap" style="max-height:none">
<table>
<thead>
<tr>
<th>Size</th>
<th style="text-align:right">Jobs</th>
<th style="text-align:right">Avg Duration</th>
</tr>
</thead>
<tbody>
{% for s in by_size %}
<tr>
<td style="font-weight:500;color:var(--text-strong)">{{ s.size_tb }} TB</td>
<td class="mono text-muted" style="text-align:right">{{ s.total }}</td>
<td class="mono" style="text-align:right;color:var(--text-strong)">{{ s.avg_hours }}h</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
{% else %}
<div class="empty-state" style="border:1px solid var(--border);border-radius:8px;padding:32px">No completed jobs yet.</div>
{% endif %}
</div>
<!-- Failure breakdown by stage -->
<div class="stats-section">
<h2 class="section-title">Failures by Stage</h2>
{% if by_failure_stage %}
<div class="table-wrap" style="max-height:none">
<table>
<thead>
<tr>
<th>Stage</th>
<th style="text-align:right">Count</th>
</tr>
</thead>
<tbody>
{% for f in by_failure_stage %}
<tr>
<td style="font-weight:500;color:var(--red)">{{ f.failed_stage | replace('_',' ') | title }}</td>
<td class="mono" style="text-align:right;color:var(--red)">{{ f.count }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
{% else %}
<div class="empty-state" style="border:1px solid var(--border);border-radius:8px;padding:32px">No failures recorded.</div>
{% endif %}
</div>
</div>
{% endblock %}

View file

@ -0,0 +1,150 @@
"""
WebSocket asyncssh PTY bridge for the live terminal drawer tab.
Protocol
--------
Client server: binary = raw terminal input bytes
text = JSON control message, e.g. {"type":"resize","cols":80,"rows":24}
Server client: binary = raw terminal output bytes
"""
import asyncio
import json
import logging
import asyncssh
from fastapi import WebSocket, WebSocketDisconnect
log = logging.getLogger(__name__)
async def handle(ws: WebSocket) -> None:
"""Accept a WebSocket connection and bridge it to an SSH PTY."""
await ws.accept()
from app.config import settings # late import — avoids circular at module level
# ── Guard: SSH must be configured ──────────────────────────────────────
if not settings.ssh_host:
await _send(ws,
b"\r\n\x1b[33mSSH not configured.\x1b[0m "
b"Set SSH Host in \x1b[1mSettings \u2192 SSH\x1b[0m first.\r\n"
)
await ws.close(1008)
return
connect_kw: dict = dict(
host=settings.ssh_host,
port=settings.ssh_port,
username=settings.ssh_user,
known_hosts=None,
)
if settings.ssh_key.strip():
try:
connect_kw["client_keys"] = [asyncssh.import_private_key(settings.ssh_key)]
except Exception as exc:
await _send(ws, f"\r\n\x1b[31mBad SSH key: {exc}\x1b[0m\r\n".encode())
await ws.close(1011)
return
elif settings.ssh_password:
connect_kw["password"] = settings.ssh_password
else:
# Fall back to mounted key file (same logic as ssh_client._connect)
import os
from app import ssh_client as _sc
key_path = os.environ.get("SSH_KEY_FILE", _sc._MOUNTED_KEY_PATH)
if os.path.exists(key_path):
connect_kw["client_keys"] = [key_path]
else:
await _send(ws,
b"\r\n\x1b[33mNo SSH credentials configured.\x1b[0m "
b"Set a password or private key in Settings.\r\n"
)
await ws.close(1008)
return
await _send(ws,
f"\r\n\x1b[36mConnecting to {settings.ssh_host}\u2026\x1b[0m\r\n".encode()
)
# ── Open SSH connection ─────────────────────────────────────────────────
try:
async with asyncssh.connect(**connect_kw) as conn:
process = await conn.create_process(
term_type="xterm-256color",
term_size=(80, 24),
encoding=None, # raw bytes — xterm.js handles encoding
)
await _send(ws, b"\r\n\x1b[32mConnected\x1b[0m\r\n\r\n")
stop = asyncio.Event()
async def ssh_to_ws() -> None:
try:
async for chunk in process.stdout:
await ws.send_bytes(chunk)
except Exception:
pass
finally:
stop.set()
async def ws_to_ssh() -> None:
try:
while not stop.is_set():
msg = await ws.receive()
if msg["type"] == "websocket.disconnect":
break
if msg.get("bytes"):
process.stdin.write(msg["bytes"])
elif msg.get("text"):
try:
ctrl = json.loads(msg["text"])
if ctrl.get("type") == "resize":
process.change_terminal_size(
int(ctrl["cols"]), int(ctrl["rows"])
)
except Exception:
pass
except WebSocketDisconnect:
pass
except Exception:
pass
finally:
stop.set()
t1 = asyncio.create_task(ssh_to_ws())
t2 = asyncio.create_task(ws_to_ssh())
_done, pending = await asyncio.wait(
[t1, t2], return_when=asyncio.FIRST_COMPLETED
)
for t in pending:
t.cancel()
try:
await t
except asyncio.CancelledError:
pass
except asyncssh.PermissionDenied:
await _send(ws, b"\r\n\x1b[31mSSH permission denied.\x1b[0m\r\n")
except asyncssh.DisconnectError as exc:
await _send(ws, f"\r\n\x1b[31mSSH disconnected: {exc}\x1b[0m\r\n".encode())
except OSError as exc:
await _send(ws, f"\r\n\x1b[31mCannot reach {settings.ssh_host}: {exc}\x1b[0m\r\n".encode())
except Exception as exc:
log.exception("Terminal WebSocket error")
await _send(ws, f"\r\n\x1b[31mError: {exc}\x1b[0m\r\n".encode())
finally:
try:
await ws.close()
except Exception:
pass
async def _send(ws: WebSocket, data: bytes) -> None:
"""Best-effort send — silently swallow errors if the socket is already gone."""
try:
await ws.send_bytes(data)
except Exception:
pass

View file

@ -65,7 +65,13 @@ class TrueNASClient:
"get_disks",
)
r.raise_for_status()
return r.json()
disks = r.json()
# Filter out expired records — TrueNAS keeps historical entries for removed
# disks with expiretime set. Only return currently-present drives.
active = [d for d in disks if not d.get("expiretime")]
if len(active) < len(disks):
log.debug("get_disks: filtered %d expired record(s)", len(disks) - len(active))
return active
async def get_smart_jobs(self, state: str | None = None) -> list[dict]:
params: dict = {"method": "smart.test"}
@ -110,3 +116,49 @@ class TrueNASClient:
)
r.raise_for_status()
return r.json()
async def get_disk_temperatures(self) -> dict[str, float | None]:
"""
Returns {devname: celsius | None}.
Uses POST /api/v2.0/disk/temperatures available on TrueNAS SCALE 25.10+.
CORE compatibility: raises on 404/405, caller should catch and skip.
"""
r = await _with_retry(
lambda: self._client.post("/api/v2.0/disk/temperatures", json={}),
"get_disk_temperatures",
)
r.raise_for_status()
return r.json()
async def wipe_disk(self, devname: str, mode: str = "FULL") -> int:
"""
Start a disk wipe job. Not retried duplicate starts would launch a second wipe.
mode: "QUICK" (wipe MBR/partitions only), "FULL" (write zeros), "FULL_RANDOM" (write random)
devname: basename only, e.g. "ada0" (not "/dev/ada0")
Returns the TrueNAS job ID.
"""
r = await self._client.post(
"/api/v2.0/disk/wipe",
json={"dev": devname, "mode": mode},
)
r.raise_for_status()
return r.json()
async def get_job(self, job_id: int) -> dict | None:
"""
Fetch a single TrueNAS job by ID.
Returns the job dict, or None if not found.
"""
import json as _json
r = await _with_retry(
lambda: self._client.get(
"/api/v2.0/core/get_jobs",
params={"filters": _json.dumps([["id", "=", job_id]])},
),
f"get_job({job_id})",
)
r.raise_for_status()
jobs = r.json()
if isinstance(jobs, list) and jobs:
return jobs[0]
return None

View file

@ -1,10 +1,13 @@
services:
mock-truenas:
build: ./mock-truenas
container_name: mock-truenas
ports:
- "8000:8000"
restart: unless-stopped
# mock-truenas is kept for local dev — not started in production
# To use mock mode: docker compose --profile mock up
# mock-truenas:
# build: ./mock-truenas
# container_name: mock-truenas
# ports:
# - "8000:8000"
# profiles: [mock]
# restart: unless-stopped
app:
build: .
@ -16,6 +19,5 @@ services:
- ./data:/data
- ./app/templates:/opt/app/app/templates
- ./app/static:/opt/app/app/static
depends_on:
- mock-truenas
- /home/brandon/.ssh/id_ed25519:/run/secrets/ssh_key:ro
restart: unless-stopped

View file

@ -1,7 +1,8 @@
fastapi
uvicorn
uvicorn[standard]
aiosqlite
httpx
pydantic-settings
jinja2
sse-starlette
asyncssh