truenas-burnin/CLAUDE.md
Brandon Walter 645d55cfcc docs: update CLAUDE.md and SPEC.md for Stage 8 (live terminal)
Documents WebSocket terminal architecture, xterm.js lazy loading,
message protocol, tab lifecycle, and reconnect behavior.

SPEC.md: updated drawer tabs (4 tabs including Terminal), added WS
endpoint, corrected bad block threshold default (0, not 2), version
bumped to 1.0.0-8.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 11:16:29 -05:00

609 lines
29 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TrueNAS Burn-In Dashboard — Project Context
> Drop this file in any new Claude session to resume work with full context.
> Last updated: 2026-02-24 (Stage 8)
---
## What This Is
A self-hosted web dashboard for running and tracking hard-drive burn-in tests
against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138).
- **App URL**: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz)
- **Stack path on maple.local**: `~/docker/stacks/truenas-burnin/`
- **Source (local mac)**: `~/Desktop/claude-sandbox/truenas-burnin/`
- **Compose synced to maple.local** via `scp` or manual copy
### Stages completed
| Stage | Description | Status |
|-------|-------------|--------|
| 1 | Mock TrueNAS CORE v2.0 API (15 drives, sdasdo) | ✅ |
| 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ |
| 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ |
| 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ |
| 5 | History page, job detail page, CSV export | ✅ |
| 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ |
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
| 7 | SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish | ✅ |
| 8 | Live SSH terminal in drawer (xterm.js + asyncssh WebSocket PTY bridge) | ✅ |
---
## File Map
```
truenas-burnin/
├── docker-compose.yml # two services: mock-truenas + app
├── Dockerfile # app container
├── requirements.txt
├── .env.example
├── data/ # SQLite DB lives here (gitignored, created on deploy)
├── mock-truenas/
│ ├── Dockerfile
│ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API
└── app/
├── __init__.py
├── config.py # pydantic-settings; reads .env
├── database.py # schema, migrations, init_db(), get_db()
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
├── ssh_client.py # asyncssh client: smartctl parsing, badblocks streaming, test_connection
├── terminal.py # WebSocket ↔ asyncssh PTY bridge for live terminal tab
├── truenas.py # httpx async client with retry (lambda factory pattern)
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
├── notifier.py # webhook + immediate email alerts on job completion
├── mailer.py # daily HTML email + per-job alert email
├── logging_config.py # structured JSON logging
├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
├── routes.py # all FastAPI route handlers
├── main.py # app factory, IP allowlist middleware, lifespan
├── static/
│ ├── app.css # full dark theme + mobile responsive
│ └── app.js # push notifications, batch, elapsed timers, inline edit
└── templates/
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
├── dashboard.html # stats bar, failed banner, batch bar, log drawer (4 tabs: Burn-In/SMART/Events/Terminal)
├── history.html
├── job_detail.html # + Print/Export button
├── audit.html # audit event log
├── stats.html # analytics: pass rate by model, daily activity, duration by size, failures by stage
├── settings.html # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
└── components/
├── drives_table.html # checkboxes, elapsed time, location inline edit
├── modal_start.html # single-drive burn-in modal
└── modal_batch.html # batch burn-in modal
```
---
## Architecture Overview
```
Browser ──HTMX SSE──▶ GET /sse/drives
poller.subscribe()
asyncio.Queue ◀─── poller.run() notifies after each poll
│ & after each burnin stage update
render drives_table.html
yield SSE "drives-update" event
```
- **Poller** (`poller.py`): runs every `POLL_INTERVAL_SECONDS` (default 12s), calls
TrueNAS `/api/v2.0/disk` and `/api/v2.0/core/get_jobs`, writes to SQLite,
notifies SSE subscribers
- **Burn-in** (`burnin.py`): `asyncio.Semaphore(max_parallel_burnins)` gates
concurrency. Jobs are created immediately (queued state), semaphore gates
actual execution. On startup, any interrupted running jobs → state=unknown;
queued jobs are re-enqueued.
- **SSE** (`routes.py /sse/drives`): one persistent connection per browser tab.
Renders fresh `drives_table.html` HTML fragment on every notification.
- **HTMX** (`dashboard.html`): `hx-ext="sse"` + `sse-swap="drives-update"`
replaces `#drives-tbody` content without page reload.
---
## Database Schema (SQLite WAL mode)
```sql
-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
temperature_c, smart_health, last_polled_at)
-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
state, percent, started_at, eta_at, finished_at, error_text,
UNIQUE(drive_id, test_type))
-- burnin_jobs: one row per burn-in run (multiple per drive over time)
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
failed/cancelled/unknown), percent, stage_name, operator,
created_at, started_at, finished_at, error_text)
-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
started_at, finished_at, error_text,
log_text TEXT, -- raw smartctl/badblocks SSH output
bad_blocks INTEGER) -- bad sector count from surface_validate
-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
-- drives columns added by migrations:
-- location TEXT, notes TEXT (Stage 6b)
-- smart_attrs TEXT -- JSON blob of last SMART attribute snapshot (Stage 7)
-- smart_tests columns added by migrations:
-- raw_output TEXT -- raw smartctl -a output (Stage 7)
```
---
## Burn-In Stage Definitions
```python
STAGE_ORDER = {
"quick": ["precheck", "short_smart", "io_validate", "final_check"],
"full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
}
```
The UI only exposes **full** profile (destructive). Quick profile exists for dev/testing.
---
## TrueNAS API Contracts Used
| Method | Endpoint | Notes |
|--------|----------|-------|
| GET | `/api/v2.0/disk` | List all disks |
| POST | `/api/v2.0/smart/test` | Start SMART test `{disks:[name], type:"SHORT"\|"LONG"}` |
| GET | `/api/v2.0/core/get_jobs` | Filter `[["method","=","smart.test"]]` |
| POST | `/api/v2.0/core/job_abort` | `job_id` positional arg |
| GET | `/api/v2.0/smart/test/results/{disk}` | Per-disk SMART results |
Auth: `Authorization: Bearer {TRUENAS_API_KEY}` header.
---
## Config / Environment Variables
All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
| Variable | Default | Notes |
|----------|---------|-------|
| `APP_HOST` | `0.0.0.0` | |
| `APP_PORT` | `8080` | |
| `DB_PATH` | `/data/app.db` | Inside container |
| `TRUENAS_BASE_URL` | `http://localhost:8000` | Point at mock or real TrueNAS |
| `TRUENAS_API_KEY` | `mock-key` | Real API key for prod |
| `TRUENAS_VERIFY_TLS` | `false` | Set true for prod with valid cert |
| `POLL_INTERVAL_SECONDS` | `12` | |
| `STALE_THRESHOLD_SECONDS` | `45` | UI shows warning if data older than this |
| `MAX_PARALLEL_BURNINS` | `2` | asyncio.Semaphore limit |
| `SURFACE_VALIDATE_SECONDS` | `45` | Mock only — duration of surface stage |
| `IO_VALIDATE_SECONDS` | `25` | Mock only — duration of I/O stage |
| `STUCK_JOB_HOURS` | `24` | Hours before a running job is auto-marked unknown |
| `LOG_LEVEL` | `INFO` | |
| `ALLOWED_IPS` | `` | Empty = allow all. Comma-sep IPs/CIDRs |
| `SMTP_HOST` | `` | Empty = email disabled |
| `SMTP_PORT` | `587` | |
| `SMTP_USER` | `` | |
| `SMTP_PASSWORD` | `` | |
| `SMTP_FROM` | `` | |
| `SMTP_TO` | `` | Comma-separated |
| `SMTP_REPORT_HOUR` | `8` | Local hour (0-23) to send daily report |
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
| `TEMP_WARN_C` | `46` | Temperature warning threshold (°C) |
| `TEMP_CRIT_C` | `55` | Temperature critical threshold — precheck fails above this |
| `BAD_BLOCK_THRESHOLD` | `0` | Max bad blocks allowed before surface_validate fails (0 = any bad = fail) |
| `APP_VERSION` | `1.0.0-7` | Displayed in header version badge |
| `SSH_HOST` | `` | TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST) |
| `SSH_PORT` | `22` | TrueNAS SSH port |
| `SSH_USER` | `root` | TrueNAS SSH username |
| `SSH_PASSWORD` | `` | TrueNAS SSH password (use key instead for production) |
| `SSH_KEY` | `` | TrueNAS SSH private key PEM string — loaded in-memory, never written to disk |
---
## Deploy Workflow
### First deploy (already done)
```bash
# On maple.local
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
```
### Redeploy after code changes
```bash
# Copy changed files from mac to maple.local first, e.g.:
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/
# Then on maple.local:
ssh brandon@10.0.0.138 -p 2225
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
```
### Reset the database (e.g. after schema changes)
```bash
# On maple.local — stop containers first
docker compose stop app
# Delete DB using alpine (container owns the file, sudo not available)
docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db
docker compose start app
```
### Check logs
```bash
docker compose logs -f app
docker compose logs -f mock-truenas
```
---
## Mock TrueNAS Server (`mock-truenas/app.py`)
- 15 drives: `sda``sdo`
- Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB,
2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB,
1× Seagate Barracuda ST6000DM003 6TB, 1× **FAIL001** (sdn) — always fails at ~30%
- SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
- Debug endpoints:
- `POST /debug/reset` — reset all jobs/state
- `GET /debug/state` — dump current state
- `POST /debug/complete-all-jobs` — instantly complete all running tests
---
## Key Implementation Patterns
### Retry pattern — lambda factory (NOT coroutine object)
```python
# CORRECT: pass a factory so each retry creates a fresh coroutine
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")
# WRONG: coroutine is exhausted after first await, retry silently fails
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")
```
### SSE template rendering
```python
# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
yield {"event": "drives-update", "data": html}
```
### Sticky thead scroll fix
```css
/* BOTH axes required on table-wrap for position:sticky to work on thead */
.table-wrap {
overflow: auto; /* NOT overflow-x: auto */
max-height: calc(100vh - 130px);
}
thead { position: sticky; top: 0; z-index: 10; }
```
### export.csv route ordering
```python
# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
@router.get("/api/v1/burnin/export.csv") # first
async def burnin_export_csv(...): ...
@router.get("/api/v1/burnin/{job_id}") # second
async def burnin_get(job_id: int, ...): ...
```
---
## Known Issues / Past Bugs Fixed
| Bug | Root Cause | Fix |
|-----|-----------|-----|
| `_execute_stages` used `STAGE_ORDER[profile]` ignoring custom order | Stage order stored in DB but not read back | `_run_job` reads stages from `burnin_stages ORDER BY id`; `_execute_stages` accepts `stages: list[str]` |
| Poller stuck at 'running' after completion | `_sync_history()` had early-return guard when state=running | Removed guard — `_sync_history` only called when job not in active dict |
| DB schema tables missing after edit | Tables split into separate variable never passed to `executescript()` | Put all tables in single `SCHEMA` string |
| Retry not retrying | `_with_retry(coro)` — coroutine exhausted after first fail | Changed to `_with_retry(factory: Callable[[], Coroutine])` |
| `error_text` overwritten | `_finish_stage(success=False)` overwrote error set by stage handler | `_finish_stage` omits `error_text` column in SQL when param is None |
| Cancelled stage showed 'failed' | `_execute_stages` called `_finish_stage(success=False)` on cancel | Check `_is_cancelled()`, call `_cancel_stage()` instead |
| export.csv returns 422 | Route registered after `/{job_id}`, FastAPI tries `int("export.csv")` | Move export route before parameterized route |
| Old drive names persist after mock rename | Poller upserts by `truenas_disk_id`, old rows stay | Delete `app.db` and restart |
| First row clipped behind sticky thead | `overflow-x: auto` only creates partial stacking context | Use `overflow: auto` (both axes) on `.table-wrap` |
| `rm data/app.db` permission denied | Container owns the file | Use `docker run --rm -v .../data:/data alpine rm -f /data/app.db` |
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
| `profile` NameError in `_execute_stages` | `_execute_stages` called `_recalculate_progress(job_id, profile)` but `profile` not in scope | Changed to `_recalculate_progress(job_id)` — profile param was unused |
| `app_version` Jinja2 global rendered as function | Set `templates.env.globals["app_version"] = _get_app_version` (callable) | Set to the static string value directly: `= _settings.app_version` |
| All buttons broken (Short/Long/Burn-In/Cancel) | `stages.forEach(function(s){` in `_drawerRenderBurnin` missing closing `});` — JS syntax error prevented entire IIFE from loading | Added missing `});` before `} else {` |
---
## Feature Reference (Stage 7)
### SSH Burn-In Architecture
`ssh_client.py` provides an optional SSH execution layer. When `SSH_HOST` is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When `SSH_HOST` is empty, stages fall back to mock/REST simulation.
**Dual-mode dispatch** — each stage checks `ssh_client.is_configured()`:
```python
if ssh_client.is_configured():
# run smartctl / badblocks over SSH
else:
# simulate with REST API or timed sleep (mock mode)
```
**SSH client capabilities** (`ssh_client.py`):
- `test_connection()` → `{"ok": bool, "error": str}` — used by Test SSH button
- `get_smart_attributes(devname)` → parse `smartctl -a`, return `{health, raw_output, attributes, warnings, failures}`
- `start_smart_test(devname, test_type)` → `smartctl -t short|long /dev/{devname}`
- `poll_smart_progress(devname)` → `smartctl -a` during test; returns `{state, percent_remaining, output}`
- `abort_smart_test(devname)` → `smartctl -X /dev/{devname}`
- `run_badblocks(devname, on_progress, cancelled_fn)` → streams `badblocks -wsv -b 4096 -p 1`; counts bad sectors from stdout (digit-only lines)
**Key auth pattern** — key is stored as PEM string in settings, never written to disk:
```python
asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)
```
**badblocks streaming** — uses `asyncssh.create_process()` with parallel stdout/stderr draining via `asyncio.gather`. Progress updates written to DB every 20 lines to avoid excessive writes.
### SMART Attribute Monitoring
Monitored attributes and their thresholds:
| ID | Name | Any non-zero → |
|----|------|----------------|
| 5 | Reallocated_Sector_Ct | FAIL |
| 10 | Spin_Retry_Count | WARN |
| 188 | Command_Timeout | WARN |
| 197 | Current_Pending_Sector | FAIL |
| 198 | Offline_Uncorrectable | FAIL |
| 199 | UDMA_CRC_Error_Count | WARN |
SMART attrs stored as JSON blob in `drives.smart_attrs`. Updated by `final_check` stage (SSH mode) or `short_smart`/`long_smart` REST mode. Displayed in drive drawer with colour-coded table + raw `smartctl -a` output.
### Drive Reset Action
- `POST /api/v1/drives/{drive_id}/reset` — clears `smart_tests` rows to idle, clears `drives.smart_attrs`, writes audit event, notifies SSE subscribers
- Button appears in action column when `can_reset` = drive has no active burn-in AND has any non-idle smart state or smart attrs
- Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state
### New Routes (Stage 7)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/drives/{id}/reset` | Reset SMART state and attrs for a drive |
| `POST` | `/api/v1/settings/test-ssh` | Test SSH connection with current SSH settings |
| `GET` | `/api/v1/updates/check` | Check for latest release from Forgejo git.hellocomputer.xyz |
### Check for Updates
Settings page has a "Check for Updates" button that fetches:
```
GET https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest
```
Compares tag name against `settings.app_version`; shows "up to date" or "v{tag} available".
### Version Badge
`app_version` set as Jinja2 global in `renderer.py`:
```python
templates.env.globals["app_version"] = _settings.app_version
```
Displayed in header as `<span class="header-version">v{app_version}</span>` (right side, muted).
### Configurable Thresholds
`renderer.py` `_temp_class` now reads from settings instead of hardcoded values:
```python
if temp >= settings.temp_crit_c: return "temp-crit"
if temp >= settings.temp_warn_c: return "temp-warn"
```
`precheck` stage fails if `temperature_c >= settings.temp_crit_c`.
Surface validate fails if `bad_blocks > settings.bad_block_threshold` (default 0 = any bad sector = fail).
## Feature Reference (Stage 8)
### Live Terminal
A full PTY SSH terminal embedded in the log drawer as a fourth tab ("Terminal"). Requires SSH to be configured in Settings.
**Architecture:**
```
Browser (xterm.js) ──WS binary──▶ /ws/terminal (FastAPI WebSocket)
terminal.py handle()
asyncssh.connect() → create_process(term_type="xterm-256color")
asyncio tasks: ssh_to_ws() + ws_to_ssh()
```
**Message protocol** (client ↔ server):
- Client → server **binary**: raw keyboard input bytes forwarded to SSH stdin
- Client → server **text**: JSON control message — only `{"type":"resize","cols":N,"rows":N}` used currently
- Server → client **binary**: raw terminal output bytes from SSH stdout
**`app/terminal.py`** — `handle(ws)`:
1. Guard: `ssh_host` must be set; key or password must be present
2. `asyncssh.connect(known_hosts=None)` with key loaded via `import_private_key()` (never written to disk)
3. `conn.create_process(term_type="xterm-256color", term_size=(80,24), encoding=None)` — opens shell PTY
4. Two asyncio tasks bridging the streams; `asyncio.wait(FIRST_COMPLETED)` + cancel pending on disconnect
5. ANSI-formatted status messages for connect/error states
**Frontend (app.js):**
- xterm.js 5.3.0 + xterm-addon-fit 0.8.0 loaded **lazily** on first Terminal tab click (CDN, ~300KB — not loaded until needed)
- `_termInit()` creates Terminal + FitAddon, opens into the panel div, registers `onData` once
- `ResizeObserver` on the panel → `fit()` + sends `resize` JSON to server
- `_termConnect()` called on init and by Reconnect button — guards against double-connect with `readyState <= 1` check
- `onData` always writes to current `_termWs` by reference — multiple reconnects don't add duplicate handlers
- Reconnect bar floats over terminal on `ws.onclose`; removed on `ws.onopen`
**Tab lifecycle:**
- Terminal tab click → `openTerminalTab()`: loads libs → `_termInit()` → `_termConnect()` on first open; just refits on subsequent opens
- Autoscroll label hidden when terminal tab is active (not applicable)
- WebSocket stays alive when drawer closes — shell persists until page unload or explicit disconnect
**New route:**
| Method | Path | Description |
|--------|------|-------------|
| `WS` | `/ws/terminal` | asyncssh PTY bridge |
**Config used:** `ssh_host`, `ssh_port`, `ssh_user`, `ssh_key`, `ssh_password` — same SSH settings as burn-in stages.
**xterm.js theme:** GitHub Dark color palette (matches app dark theme). `scrollback: 2000`. Font: SF Mono / Fira Code / Consolas.
### Cutting to Real TrueNAS (Next Steps)
When ready to test against a real TrueNAS CORE box:
1. In Settings (or `.env`), set:
- **TrueNAS URL** → `https://10.0.0.X` (real IP)
- **API Key** → real API key
- **SSH Host** → same IP as TrueNAS
- **SSH User** → `root` (or sudoer with smartctl/badblocks access)
- **SSH Key** → paste PEM key into textarea
2. Click **Test SSH Connection** to verify before starting a burn-in
3. TrueNAS CORE uses `ada0`, `da0` device names (not `sda`). Mock drive names will differ.
4. Delete `app.db` before first real poll to clear mock drive rows
5. Comment out `mock-truenas` service in `docker-compose.yml` (optional — harmless to leave)
6. Verify TrueNAS CORE v2.0 REST API:
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
---
## Feature Reference (Stage 6b)
### New Pages
| URL | Description |
|-----|-------------|
| `/stats` | Analytics — pass rate by model, daily activity last 14 days |
| `/audit` | Audit log — last 200 events with drive/operator context |
| `/settings` | Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) |
| `/history/{id}/print` | Print-friendly job report with QR code |
### New API Routes (6b + 6c)
| Method | Path | Description |
|--------|------|-------------|
| `PATCH` | `/api/v1/drives/{id}` | Update `notes` and/or `location` |
| `POST` | `/api/v1/settings` | Save runtime settings to `/data/settings_overrides.json` |
| `POST` | `/api/v1/settings/test-smtp` | Test SMTP connection without sending email |
### Notifications
- **Browser push**: Bell icon in header → `Notification.requestPermission()`. Fires on `job-alert` SSE event (burnin pass/fail).
- **SSE alert event**: `job-alert` event type on `/sse/drives`. JS listens via `htmx:sseMessage`.
- **Immediate email**: `send_job_alert()` in mailer.py. Triggered by `notifier.notify_job_complete()` from burnin.py.
- **Webhook**: `notifier._send_webhook()` — POST JSON to `WEBHOOK_URL`. Payload includes event, job_id, devname, serial, model, state, operator, error_text.
### Stuck Job Detection
- `burnin.check_stuck_jobs()` runs every 5 poll cycles (~1 min)
- Jobs running longer than `STUCK_JOB_HOURS` (default 24h) → state=unknown
- Logged at CRITICAL level; audit event written
### Batch Burn-In
- Checkboxes on each idle/selectable drive row
- Batch bar appears in filter row when any drives selected
- Uses existing `POST /api/v1/burnin/start` with multiple `drive_ids`
- Requires operator name + explicit confirmation checkbox (no serial required)
- JS `checkedDriveIds` Set persists across SSE swaps via `restoreCheckboxes()`
### Drive Location
- `location` and `notes` fields added to drives table via ALTER TABLE migration
- Inline click-to-edit on location field in drive name cell
- Saves via `PATCH /api/v1/drives/{id}` on blur/Enter; restores on Escape
## Feature Reference (Stage 6c)
### Settings Page
- Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
- Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
- All changes save instantly via `POST /api/v1/settings` → `settings_store.save()` → `/data/settings_overrides.json`
- Overrides loaded on startup in `main.py` lifespan via `settings_store.init()`
- Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
- Test Connection button at top of SMTP card — tests live settings without sending email
- Brand logo in header is now a clickable `<a href="/">` home link
### SMTP Port Derivation
```python
# mailer.py — port is derived from mode, NOT from settings.smtp_port
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
port = _MODE_PORTS.get(mode, 587)
```
Never use `settings.smtp_port` in mailer — it's kept in config for `.env` backward compat only.
### Burn-In Stage Selection
`StartBurninRequest` no longer takes `profile: str`. Instead takes:
- `run_surface: bool = True` — surface validate (destructive write test)
- `run_short: bool = True` — Short SMART (non-destructive)
- `run_long: bool = True` — Long SMART (non-destructive)
Profile string is computed as a property. Profiles: `full`, `surface_short`, `surface_long`,
`surface`, `short_long`, `short`, `long`. Precheck and final_check always run.
`STAGE_ORDER` in `burnin.py` has all 7 profile combinations.
`_recalculate_progress()` uses `_STAGE_BASE_WEIGHTS` dict (per-stage weights) and computes
overall % dynamically from actual `burnin_stages` rows — no profile lookup needed.
In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:
- Destructive warning is hidden
- Serial confirmation field is hidden (single modal)
- Confirmation checkbox is hidden (batch modal)
### Table Scroll Fix
```css
.table-wrap {
max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
}
```
If stats bar or other content height changes, update this offset.
## Feature Reference (Stage 6d)
### Cancel Functionality
| What | How |
|------|-----|
| Cancel running Short SMART | `✕ Short` button appears in action col when `short_busy`; calls `POST /api/v1/drives/{id}/smart/cancel` with `{type:"short"}` |
| Cancel running Long SMART | `✕ Long` button appears when `long_busy`; same route with `{type:"long"}` |
| Cancel individual burn-in | `✕ Burn-In` button (was "Cancel") shown when `bi_active`; calls `POST /api/v1/burnin/{id}/cancel` |
| Cancel All Running | Red `✕ Cancel All Burn-Ins` button appears in filter bar when any burn-in jobs are active; JS collects all `.btn-cancel[data-job-id]` and cancels each |
**SMART cancel route** (`POST /api/v1/drives/{drive_id}/smart/cancel`):
1. Fetches all running TrueNAS jobs via `client.get_smart_jobs()`
2. Finds job where `arguments[0].disks` contains the drive's devname
3. Calls `client.abort_job(tn_job_id)`
4. Updates `smart_tests` table row to `state='aborted'`
### Stage Reordering
- Default order changed to: **Short SMART → Long SMART → Surface Validate** (non-destructive first)
- Drag handles (⠿) on each stage row in both single and batch modals
- HTML5 drag-and-drop, no external library
- `getStageOrder(listId)` reads current DOM order of checked stages
- `stage_order: ["short_smart","long_smart","surface_validate"]` sent in API body
- `StartBurninRequest.stage_order: list[str] | None` — validated against allowed stage names
- `burnin.start_job()` accepts `stage_order` param; builds: `["precheck"] + stage_order + ["final_check"]`
- `_run_job()` reads stage names back from `burnin_stages ORDER BY id` — so custom order is honoured
- Destructive warning / serial confirmation still triggered by `stage-surface` checkbox ID (order-independent)
## NPM / DNS Setup
- Proxy host: `burnin.hellocomputer.xyz` → `http://10.0.0.138:8080`
- Authelia protection: recommended (no built-in auth in app)
- DNS: `burnin.hellocomputer.xyz` CNAME → `sandon.hellocomputer.xyz` (proxied: false)