nas-burnin/CLAUDE.md
Brandon Walter 8ae84862de
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
infra: rename truenas-burnin → nas-burnin (1.0.0-41)
Matches the 1.0.0-38 product display rename. Touches every
infrastructure identifier:

- container_name: truenas-burnin → nas-burnin
- forge URL in /api/v1/updates/check
- security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description
- run-tests.sh default container name
- doc paths in README/SPEC/CLAUDE
- in-app instruction strings (login.html, settings.html, auth_cli.py)

Maple migration done in lockstep:
  docker compose down (truenas-burnin)
  mv ~/docker/stacks/{truenas-burnin,nas-burnin}
  systemd unit ExecStart updated + daemon-reload
  docker compose up -d --build → container nas-burnin
  Old image truenas-burnin-app removed (~12 GB reclaimed)
  Stale top-level orphans cleaned (config.py, poller.py, routes.py,
  truenas.py, tests/) — all dead since pre-split refactors

Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin →
nas-burnin) is a separate UI-only step. Forgejo redirects the old
URL after rename, so this commit can be pushed to the existing
remote first; remote URL gets updated locally once you rename.
2026-05-04 07:16:02 -07:00

642 lines
33 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NAS Burn-In Dashboard — Project Context
> Drop this file in any new Claude session to resume work with full context.
> Last updated: 2026-05-03 (v1.0.0-39 — live against TrueNAS SCALE 25.10)
---
## What This Is
A self-hosted web dashboard for running and tracking hard-drive burn-in tests
against a TrueNAS SCALE 25.10 instance. Deployed on **maple.local** (10.0.0.138).
- **App URL**: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz)
- **Stack path on maple.local**: `~/docker/stacks/nas-burnin/`
- **Source (local mac)**: `~/Desktop/claudesandbox/nas-burnin/`
- **Compose synced to maple.local** via `scp` or manual copy
### Stages completed
| Stage | Description | Status |
|-------|-------------|--------|
| 1 | Mock TrueNAS CORE v2.0 API (15 drives, sdasdo) | ✅ |
| 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ |
| 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ |
| 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ |
| 5 | History page, job detail page, CSV export | ✅ |
| 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ |
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
| 7 | SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish | ✅ |
| 8 | Live against TrueNAS SCALE 25.10: SSH SMART, disk temps, CPU/PCH sensors, thermal gate | ✅ |
---
## File Map
```
nas-burnin/
├── docker-compose.yml # two services: mock-truenas + app
├── Dockerfile # app container
├── requirements.txt
├── .env.example
├── data/ # SQLite DB lives here (gitignored, created on deploy)
├── mock-truenas/
│ ├── Dockerfile
│ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API
└── app/
├── __init__.py
├── config.py # pydantic-settings; reads .env
├── database.py # schema, migrations, init_db(), get_db()
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
├── ssh_client.py # asyncssh client: smartctl parsing, badblocks streaming, sensors, test_connection
├── truenas.py # httpx async client with retry (lambda factory pattern)
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
├── notifier.py # webhook + immediate email alerts on job completion
├── mailer.py # daily HTML email + per-job alert email
├── logging_config.py # structured JSON logging
├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
├── routes.py # all FastAPI route handlers
├── main.py # app factory, IP allowlist middleware, lifespan
├── static/
│ ├── app.css # full dark theme + mobile responsive
│ └── app.js # push notifications, batch, elapsed timers, inline edit
└── templates/
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
├── dashboard.html # stats bar (+ CPU/PCH sensors, thermal chip), failed banner, batch bar, log drawer (3 tabs: Burn-In/SMART/Events)
├── history.html
├── job_detail.html # + Print/Export button
├── audit.html # audit event log
├── stats.html # analytics: pass rate by model, daily activity, duration by size, failures by stage
├── settings.html # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
└── components/
├── drives_table.html # checkboxes, elapsed time, location inline edit
├── modal_start.html # single-drive burn-in modal
└── modal_batch.html # batch burn-in modal
```
---
## Architecture Overview
```
Browser ──HTMX SSE──▶ GET /sse/drives
poller.subscribe()
asyncio.Queue ◀─── poller.run() notifies after each poll
│ & after each burnin stage update
render drives_table.html
yield SSE "drives-update" event
```
- **Poller** (`poller.py`): runs every `POLL_INTERVAL_SECONDS` (default 12s), calls
TrueNAS `/api/v2.0/disk` and `/api/v2.0/core/get_jobs`, writes to SQLite,
notifies SSE subscribers
- **Burn-in** (`burnin.py`): `asyncio.Semaphore(max_parallel_burnins)` gates
concurrency. Jobs are created immediately (queued state), semaphore gates
actual execution. On startup, any interrupted running jobs → state=unknown;
queued jobs are re-enqueued.
- **SSE** (`routes.py /sse/drives`): one persistent connection per browser tab.
Renders fresh `drives_table.html` HTML fragment on every notification.
- **HTMX** (`dashboard.html`): `hx-ext="sse"` + `sse-swap="drives-update"`
replaces `#drives-tbody` content without page reload.
---
## Database Schema (SQLite WAL mode)
```sql
-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
temperature_c, smart_health, last_polled_at)
-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
state, percent, started_at, eta_at, finished_at, error_text,
UNIQUE(drive_id, test_type))
-- burnin_jobs: one row per burn-in run (multiple per drive over time)
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
failed/cancelled/unknown), percent, stage_name, operator,
created_at, started_at, finished_at, error_text)
-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
started_at, finished_at, error_text,
log_text TEXT, -- raw smartctl/badblocks SSH output
bad_blocks INTEGER) -- bad sector count from surface_validate
-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
-- drives columns added by migrations:
-- location TEXT, notes TEXT (Stage 6b)
-- smart_attrs TEXT -- JSON blob of last SMART attribute snapshot (Stage 7)
-- smart_tests columns added by migrations:
-- raw_output TEXT -- raw smartctl -a output (Stage 7)
```
---
## Burn-In Stage Definitions
```python
STAGE_ORDER = {
"quick": ["precheck", "short_smart", "io_validate", "final_check"],
"full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
}
```
The UI only exposes **full** profile (destructive). Quick profile exists for dev/testing.
---
## TrueNAS API Contracts Used
| Method | Endpoint | Notes |
|--------|----------|-------|
| GET | `/api/v2.0/disk` | List all disks |
| POST | `/api/v2.0/smart/test` | Start SMART test `{disks:[name], type:"SHORT"\|"LONG"}` |
| GET | `/api/v2.0/core/get_jobs` | Filter `[["method","=","smart.test"]]` |
| POST | `/api/v2.0/core/job_abort` | `job_id` positional arg |
| GET | `/api/v2.0/smart/test/results/{disk}` | Per-disk SMART results |
Auth: `Authorization: Bearer {TRUENAS_API_KEY}` header.
---
## Config / Environment Variables
All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
| Variable | Default | Notes |
|----------|---------|-------|
| `APP_HOST` | `0.0.0.0` | |
| `APP_PORT` | `8080` | |
| `DB_PATH` | `/data/app.db` | Inside container |
| `TRUENAS_BASE_URL` | `http://localhost:8000` | Point at mock or real TrueNAS |
| `TRUENAS_API_KEY` | `mock-key` | Real API key for prod |
| `TRUENAS_VERIFY_TLS` | `false` | Set true for prod with valid cert |
| `POLL_INTERVAL_SECONDS` | `12` | |
| `STALE_THRESHOLD_SECONDS` | `45` | UI shows warning if data older than this |
| `MAX_PARALLEL_BURNINS` | `2` | asyncio.Semaphore limit |
| `SURFACE_VALIDATE_SECONDS` | `45` | Mock only — duration of surface stage |
| `IO_VALIDATE_SECONDS` | `25` | Mock only — duration of I/O stage |
| `STUCK_JOB_HOURS` | `24` | Hours before a running job is auto-marked unknown |
| `LOG_LEVEL` | `INFO` | |
| `ALLOWED_IPS` | `` | Empty = allow all. Comma-sep IPs/CIDRs |
| `SMTP_HOST` | `` | Empty = email disabled |
| `SMTP_PORT` | `587` | |
| `SMTP_USER` | `` | |
| `SMTP_PASSWORD` | `` | |
| `SMTP_FROM` | `` | |
| `SMTP_TO` | `` | Comma-separated |
| `SMTP_REPORT_HOUR` | `8` | Local hour (0-23) to send daily report |
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
| `TEMP_WARN_C` | `46` | Temperature warning threshold (°C) |
| `TEMP_CRIT_C` | `55` | Temperature critical threshold — precheck fails above this |
| `BAD_BLOCK_THRESHOLD` | `0` | Max bad blocks allowed before surface_validate fails (0 = any bad = fail) |
| `APP_VERSION` | `1.0.0-9` | Displayed in header version badge |
| `SSH_HOST` | `` | TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST) |
| `SSH_PORT` | `22` | TrueNAS SSH port |
| `SSH_USER` | `root` | TrueNAS SSH username |
| `SSH_PASSWORD` | `` | TrueNAS SSH password (use key instead for production) |
| `SSH_KEY` | `` | TrueNAS SSH private key PEM string — loaded in-memory, never written to disk |
---
## Deploy Workflow
### First deploy (already done)
```bash
# On maple.local
cd ~/docker/stacks/nas-burnin
docker compose up -d --build
```
### Redeploy after code changes
```bash
# Copy changed files from mac to maple.local first, e.g.:
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/nas-burnin/
# Then on maple.local:
ssh brandon@10.0.0.138 -p 2225
cd ~/docker/stacks/nas-burnin
docker compose up -d --build
```
### Reset the database (e.g. after schema changes)
```bash
# On maple.local — stop containers first
docker compose stop app
# Delete DB using alpine (container owns the file, sudo not available)
docker run --rm -v ~/docker/stacks/nas-burnin/data:/data alpine rm -f /data/app.db
docker compose start app
```
### Check logs
```bash
docker compose logs -f app
docker compose logs -f mock-truenas
```
---
## Mock TrueNAS Server (`mock-truenas/app.py`)
- 15 drives: `sda``sdo`
- Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB,
2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB,
1× Seagate Barracuda ST6000DM003 6TB, 1× **FAIL001** (sdn) — always fails at ~30%
- SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
- Debug endpoints:
- `POST /debug/reset` — reset all jobs/state
- `GET /debug/state` — dump current state
- `POST /debug/complete-all-jobs` — instantly complete all running tests
---
## Key Implementation Patterns
### Retry pattern — lambda factory (NOT coroutine object)
```python
# CORRECT: pass a factory so each retry creates a fresh coroutine
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")
# WRONG: coroutine is exhausted after first await, retry silently fails
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")
```
### SSE template rendering
```python
# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
yield {"event": "drives-update", "data": html}
```
### Sticky thead scroll fix
```css
/* BOTH axes required on table-wrap for position:sticky to work on thead */
.table-wrap {
overflow: auto; /* NOT overflow-x: auto */
max-height: calc(100vh - 130px);
}
thead { position: sticky; top: 0; z-index: 10; }
```
### Burn-in SMART column overlay
```python
# When a burn-in runs a short_smart or long_smart stage, its progress must be
# mirrored in the Short/Long SMART columns (which normally read from smart_tests table).
# _fetch_drives_for_template() queries burnin_stages for running/completed SMART stages
# and overlays them onto the drive dict. Only overlays if standalone SMART column is idle.
# Helper: _compute_eta_seconds(started_at, percent) for linear ETA extrapolation.
```
### export.csv route ordering
```python
# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
@router.get("/api/v1/burnin/export.csv") # first
async def burnin_export_csv(...): ...
@router.get("/api/v1/burnin/{job_id}") # second
async def burnin_get(job_id: int, ...): ...
```
---
## Known Issues / Past Bugs Fixed
| Bug | Root Cause | Fix |
|-----|-----------|-----|
| `_execute_stages` used `STAGE_ORDER[profile]` ignoring custom order | Stage order stored in DB but not read back | `_run_job` reads stages from `burnin_stages ORDER BY id`; `_execute_stages` accepts `stages: list[str]` |
| Poller stuck at 'running' after completion | `_sync_history()` had early-return guard when state=running | Removed guard — `_sync_history` only called when job not in active dict |
| DB schema tables missing after edit | Tables split into separate variable never passed to `executescript()` | Put all tables in single `SCHEMA` string |
| Retry not retrying | `_with_retry(coro)` — coroutine exhausted after first fail | Changed to `_with_retry(factory: Callable[[], Coroutine])` |
| `error_text` overwritten | `_finish_stage(success=False)` overwrote error set by stage handler | `_finish_stage` omits `error_text` column in SQL when param is None |
| Cancelled stage showed 'failed' | `_execute_stages` called `_finish_stage(success=False)` on cancel | Check `_is_cancelled()`, call `_cancel_stage()` instead |
| export.csv returns 422 | Route registered after `/{job_id}`, FastAPI tries `int("export.csv")` | Move export route before parameterized route |
| Old drive names persist after mock rename | Poller upserts by `truenas_disk_id`, old rows stay | Delete `app.db` and restart |
| First row clipped behind sticky thead | `overflow-x: auto` only creates partial stacking context | Use `overflow: auto` (both axes) on `.table-wrap` |
| `rm data/app.db` permission denied | Container owns the file | Use `docker run --rm -v .../data:/data alpine rm -f /data/app.db` |
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
| `profile` NameError in `_execute_stages` | `_execute_stages` called `_recalculate_progress(job_id, profile)` but `profile` not in scope | Changed to `_recalculate_progress(job_id)` — profile param was unused |
| `app_version` Jinja2 global rendered as function | Set `templates.env.globals["app_version"] = _get_app_version` (callable) | Set to the static string value directly: `= _settings.app_version` |
| All buttons broken (Short/Long/Burn-In/Cancel) | `stages.forEach(function(s){` in `_drawerRenderBurnin` missing closing `});` — JS syntax error prevented entire IIFE from loading | Added missing `});` before `} else {` |
| Burn-in SMART stage shows in wrong column | Burn-in orchestrator tracks SMART progress in `burnin_stages` table, but SMART columns read from `smart_tests` table only | `_fetch_drives_for_template` now queries `burnin_stages` for active burn-ins and overlays SMART stage progress/results onto the Short/Long SMART columns |
| 14TB surface jobs marked `failed` after 6-day clean run (1.0.0-10) | `_stage_final_check` treated `ssh_client.get_smart_attributes` failures as drive failures, but that helper swallows transport errors and returns `failures: ["SSH error: ..."]`. A 1-second SSH blip invalidated multi-day surface scans. | `_stage_final_check` now distinguishes pure SSH-only failures (every entry starts with `"SSH error:"`) from real SMART failures; retries 3× with 30s gaps; soft-passes on persistent SSH-only — surface stages stand. |
| `database is locked` during long_smart (1.0.0-11) | `_stage_smart_test_ssh` appended full smartctl output to `log_text` every 5s poll. SQLite's `COALESCE(log_text,'')||?` rewrites the whole column, and over 6+ hours `log_text` grew to 50 MB → contention against poller/orchestrator/settings writers. | (a) `_db()` is now an `@asynccontextmanager` setting `PRAGMA busy_timeout=10000` per connection. (b) log_text appends throttled to every 12 polls (~60s) or on state change. |
| Stuck stage rows linger as `running` after `check_stuck_jobs` (1.0.0-11) | Stuck-job detector updated `burnin_jobs.state='unknown'` but didn't touch stage rows. | Added `UPDATE burnin_stages SET state='unknown', finished_at=? WHERE burnin_job_id=? AND state='running'` to the same transaction. |
| Dashboard 500 — `TypeError: unhashable type: 'dict'` from Jinja (1.0.0-12) | Starlette 1.0.0 (released 2026-04) removed the legacy `TemplateResponse(name, context)` signature. With the old call style, the context dict ended up where `name` was expected, → Jinja `cache_key` was unhashable. | Migrated all 7 calls to new signature: `TemplateResponse(request, name, context)`. **Root enabler**: `requirements.txt` is unpinned, so `--build` pulled the latest breaking release. |
---
## Operational Gotchas
### `requirements.txt` is unpinned
Every `docker compose up -d --build` pulls latest of fastapi, starlette, jinja2, asyncssh, etc. The Starlette 1.0 regression on 2026-04-27 is a direct consequence. **Either pin to known-good versions, or audit installed versions immediately after each rebuild** with:
```bash
docker exec nas-burnin python3 -c "import fastapi, starlette, jinja2; print(fastapi.__version__, starlette.__version__, jinja2.__version__)"
```
### Local source ↔ maple host can drift
The deploy convention is `scp -r app/` from mac to maple, but if you ever edit on maple directly (or skip an `scp` after local changes), the two trees diverge. As of 2026-04-27 the local `routes.py` had unsynced SMART-overlay work but was missing the deployed `/ws/terminal` Stage 8 endpoint — neither side a superset.
**Always `diff -u` before bulk scp:**
```bash
ssh -p 2225 brandon@10.0.0.138 'cat ~/docker/stacks/nas-burnin/app/routes.py' > /tmp/deployed_routes.py
diff -u /tmp/deployed_routes.py ~/Desktop/claudesandbox/nas-burnin/app/routes.py
```
When sides have conflicting edits, prefer **patching the host file in place + rebuild** over a destructive scp.
---
## Feature Reference (Stage 7)
### SSH Burn-In Architecture
`ssh_client.py` provides an optional SSH execution layer. When `SSH_HOST` is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When `SSH_HOST` is empty, stages fall back to mock/REST simulation.
**Dual-mode dispatch** — each stage checks `ssh_client.is_configured()`:
```python
if ssh_client.is_configured():
# run smartctl / badblocks over SSH
else:
# simulate with REST API or timed sleep (mock mode)
```
**SSH client capabilities** (`ssh_client.py`):
- `test_connection()` → `{"ok": bool, "error": str}` — used by Test SSH button
- `get_smart_attributes(devname)` → parse `smartctl -a`, return `{health, raw_output, attributes, warnings, failures}`
- `start_smart_test(devname, test_type)` → `smartctl -t short|long /dev/{devname}`
- `poll_smart_progress(devname)` → `smartctl -a` during test; returns `{state, percent_remaining, output}`
- `abort_smart_test(devname)` → `smartctl -X /dev/{devname}`
- `run_badblocks(devname, on_progress, cancelled_fn)` → streams `badblocks -wsv -b 4096 -p 1`; counts bad sectors from stdout (digit-only lines)
**Key auth pattern** — key is stored as PEM string in settings, never written to disk:
```python
asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)
```
**badblocks streaming** — uses `asyncssh.create_process()` with parallel stdout/stderr draining via `asyncio.gather`. Progress updates written to DB every 20 lines to avoid excessive writes.
### SMART Attribute Monitoring
Monitored attributes and their thresholds:
| ID | Name | Any non-zero → |
|----|------|----------------|
| 5 | Reallocated_Sector_Ct | FAIL |
| 10 | Spin_Retry_Count | WARN |
| 188 | Command_Timeout | WARN |
| 197 | Current_Pending_Sector | FAIL |
| 198 | Offline_Uncorrectable | FAIL |
| 199 | UDMA_CRC_Error_Count | WARN |
SMART attrs stored as JSON blob in `drives.smart_attrs`. Updated by `final_check` stage (SSH mode) or `short_smart`/`long_smart` REST mode. Displayed in drive drawer with colour-coded table + raw `smartctl -a` output.
### Drive Reset Action
- `POST /api/v1/drives/{drive_id}/reset` — clears `smart_tests` rows to idle, clears `drives.smart_attrs`, writes audit event, notifies SSE subscribers
- Button appears in action column when `can_reset` = drive has no active burn-in AND has any non-idle smart state or smart attrs
- Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state
### New Routes (Stage 7)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/api/v1/drives/{id}/reset` | Reset SMART state and attrs for a drive |
| `POST` | `/api/v1/settings/test-ssh` | Test SSH connection with current SSH settings |
| `GET` | `/api/v1/updates/check` | Check for latest release from Forgejo git.hellocomputer.xyz |
### Check for Updates
Settings page has a "Check for Updates" button that fetches:
```
GET https://git.hellocomputer.xyz/api/v1/repos/brandon/nas-burnin/releases/latest
```
Compares tag name against `settings.app_version`; shows "up to date" or "v{tag} available".
### Version Badge
`app_version` set as Jinja2 global in `renderer.py`:
```python
templates.env.globals["app_version"] = _settings.app_version
```
Displayed in header as `<span class="header-version">v{app_version}</span>` (right side, muted).
### Configurable Thresholds
`renderer.py` `_temp_class` now reads from settings instead of hardcoded values:
```python
if temp >= settings.temp_crit_c: return "temp-crit"
if temp >= settings.temp_warn_c: return "temp-warn"
```
`precheck` stage fails if `temperature_c >= settings.temp_crit_c`.
Surface validate fails if `bad_blocks > settings.bad_block_threshold` (default 0 = any bad sector = fail).
## Feature Reference (Stage 8)
### Live Terminal
A full PTY SSH terminal embedded in the log drawer as a fourth tab ("Terminal"). Requires SSH to be configured in Settings.
**Architecture:**
```
Browser (xterm.js) ──WS binary──▶ /ws/terminal (FastAPI WebSocket)
terminal.py handle()
asyncssh.connect() → create_process(term_type="xterm-256color")
asyncio tasks: ssh_to_ws() + ws_to_ssh()
```
**Message protocol** (client ↔ server):
- Client → server **binary**: raw keyboard input bytes forwarded to SSH stdin
- Client → server **text**: JSON control message — only `{"type":"resize","cols":N,"rows":N}` used currently
- Server → client **binary**: raw terminal output bytes from SSH stdout
**`app/terminal.py`** — `handle(ws)`:
1. Guard: `ssh_host` must be set; key or password must be present
2. `asyncssh.connect(known_hosts=None)` with key loaded via `import_private_key()` (never written to disk)
3. `conn.create_process(term_type="xterm-256color", term_size=(80,24), encoding=None)` — opens shell PTY
4. Two asyncio tasks bridging the streams; `asyncio.wait(FIRST_COMPLETED)` + cancel pending on disconnect
5. ANSI-formatted status messages for connect/error states
**Frontend (app.js):**
- xterm.js 5.3.0 + xterm-addon-fit 0.8.0 loaded **lazily** on first Terminal tab click (CDN, ~300KB — not loaded until needed)
- `_termInit()` creates Terminal + FitAddon, opens into the panel div, registers `onData` once
- `ResizeObserver` on the panel → `fit()` + sends `resize` JSON to server
- `_termConnect()` called on init and by Reconnect button — guards against double-connect with `readyState <= 1` check
- `onData` always writes to current `_termWs` by reference — multiple reconnects don't add duplicate handlers
- Reconnect bar floats over terminal on `ws.onclose`; removed on `ws.onopen`
**Tab lifecycle:**
- Terminal tab click → `openTerminalTab()`: loads libs → `_termInit()` → `_termConnect()` on first open; just refits on subsequent opens
- Autoscroll label hidden when terminal tab is active (not applicable)
- WebSocket stays alive when drawer closes — shell persists until page unload or explicit disconnect
**New route:**
| Method | Path | Description |
|--------|------|-------------|
| `WS` | `/ws/terminal` | asyncssh PTY bridge |
**Config used:** `ssh_host`, `ssh_port`, `ssh_user`, `ssh_key`, `ssh_password` — same SSH settings as burn-in stages.
**xterm.js theme:** GitHub Dark color palette (matches app dark theme). `scrollback: 2000`. Font: SF Mono / Fira Code / Consolas.
### Cutting to Real TrueNAS (Next Steps)
When ready to test against a real TrueNAS CORE box:
1. In Settings (or `.env`), set:
- **TrueNAS URL** → `https://10.0.0.X` (real IP)
- **API Key** → real API key
- **SSH Host** → same IP as TrueNAS
- **SSH User** → `root` (or sudoer with smartctl/badblocks access)
- **SSH Key** → paste PEM key into textarea
2. Click **Test SSH Connection** to verify before starting a burn-in
3. TrueNAS CORE uses `ada0`, `da0` device names (not `sda`). Mock drive names will differ.
4. Delete `app.db` before first real poll to clear mock drive rows
5. Comment out `mock-truenas` service in `docker-compose.yml` (optional — harmless to leave)
6. Verify TrueNAS CORE v2.0 REST API:
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
---
## Feature Reference (Stage 6b)
### New Pages
| URL | Description |
|-----|-------------|
| `/stats` | Analytics — pass rate by model, daily activity last 14 days |
| `/audit` | Audit log — last 200 events with drive/operator context |
| `/settings` | Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) |
| `/history/{id}/print` | Print-friendly job report with QR code |
### New API Routes (6b + 6c)
| Method | Path | Description |
|--------|------|-------------|
| `PATCH` | `/api/v1/drives/{id}` | Update `notes` and/or `location` |
| `POST` | `/api/v1/settings` | Save runtime settings to `/data/settings_overrides.json` |
| `POST` | `/api/v1/settings/test-smtp` | Test SMTP connection without sending email |
### Notifications
- **Browser push**: Bell icon in header → `Notification.requestPermission()`. Fires on `job-alert` SSE event (burnin pass/fail).
- **SSE alert event**: `job-alert` event type on `/sse/drives`. JS listens via `htmx:sseMessage`.
- **Immediate email**: `send_job_alert()` in mailer.py. Triggered by `notifier.notify_job_complete()` from burnin.py.
- **Webhook**: `notifier._send_webhook()` — POST JSON to `WEBHOOK_URL`. Payload includes event, job_id, devname, serial, model, state, operator, error_text.
### Stuck Job Detection
- `burnin.check_stuck_jobs()` runs every 5 poll cycles (~1 min)
- Jobs running longer than `STUCK_JOB_HOURS` (default 24h) → state=unknown
- Logged at CRITICAL level; audit event written
### Batch Burn-In
- Checkboxes on each idle/selectable drive row
- Batch bar appears in filter row when any drives selected
- Uses existing `POST /api/v1/burnin/start` with multiple `drive_ids`
- Requires operator name + explicit confirmation checkbox (no serial required)
- JS `checkedDriveIds` Set persists across SSE swaps via `restoreCheckboxes()`
### Drive Location
- `location` and `notes` fields added to drives table via ALTER TABLE migration
- Inline click-to-edit on location field in drive name cell
- Saves via `PATCH /api/v1/drives/{id}` on blur/Enter; restores on Escape
## Feature Reference (Stage 6c)
### Settings Page
- Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
- Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
- All changes save instantly via `POST /api/v1/settings` → `settings_store.save()` → `/data/settings_overrides.json`
- Overrides loaded on startup in `main.py` lifespan via `settings_store.init()`
- Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
- Test Connection button at top of SMTP card — tests live settings without sending email
- Brand logo in header is now a clickable `<a href="/">` home link
### SMTP Port Derivation
```python
# mailer.py — port is derived from mode, NOT from settings.smtp_port
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
port = _MODE_PORTS.get(mode, 587)
```
Never use `settings.smtp_port` in mailer — it's kept in config for `.env` backward compat only.
### Burn-In Stage Selection
`StartBurninRequest` no longer takes `profile: str`. Instead takes:
- `run_surface: bool = True` — surface validate (destructive write test)
- `run_short: bool = True` — Short SMART (non-destructive)
- `run_long: bool = True` — Long SMART (non-destructive)
Profile string is computed as a property. Profiles: `full`, `surface_short`, `surface_long`,
`surface`, `short_long`, `short`, `long`. Precheck and final_check always run.
`STAGE_ORDER` in `burnin.py` has all 7 profile combinations.
`_recalculate_progress()` uses `_STAGE_BASE_WEIGHTS` dict (per-stage weights) and computes
overall % dynamically from actual `burnin_stages` rows — no profile lookup needed.
In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:
- Destructive warning is hidden
- Serial confirmation field is hidden (single modal)
- Confirmation checkbox is hidden (batch modal)
### Table Scroll Fix
```css
.table-wrap {
max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
}
```
If stats bar or other content height changes, update this offset.
## Feature Reference (Stage 6d)
### Cancel Functionality
| What | How |
|------|-----|
| Cancel running Short SMART | `✕ Short` button appears in action col when `short_busy`; calls `POST /api/v1/drives/{id}/smart/cancel` with `{type:"short"}` |
| Cancel running Long SMART | `✕ Long` button appears when `long_busy`; same route with `{type:"long"}` |
| Cancel individual burn-in | `✕ Burn-In` button (was "Cancel") shown when `bi_active`; calls `POST /api/v1/burnin/{id}/cancel` |
| Cancel All Running | Red `✕ Cancel All Burn-Ins` button appears in filter bar when any burn-in jobs are active; JS collects all `.btn-cancel[data-job-id]` and cancels each |
**SMART cancel route** (`POST /api/v1/drives/{drive_id}/smart/cancel`):
1. Fetches all running TrueNAS jobs via `client.get_smart_jobs()`
2. Finds job where `arguments[0].disks` contains the drive's devname
3. Calls `client.abort_job(tn_job_id)`
4. Updates `smart_tests` table row to `state='aborted'`
### Stage Reordering
- Default order changed to: **Short SMART → Long SMART → Surface Validate** (non-destructive first)
- Drag handles (⠿) on each stage row in both single and batch modals
- HTML5 drag-and-drop, no external library
- `getStageOrder(listId)` reads current DOM order of checked stages
- `stage_order: ["short_smart","long_smart","surface_validate"]` sent in API body
- `StartBurninRequest.stage_order: list[str] | None` — validated against allowed stage names
- `burnin.start_job()` accepts `stage_order` param; builds: `["precheck"] + stage_order + ["final_check"]`
- `_run_job()` reads stage names back from `burnin_stages ORDER BY id` — so custom order is honoured
- Destructive warning / serial confirmation still triggered by `stage-surface` checkbox ID (order-independent)
## NPM / DNS Setup
- Proxy host: `burnin.hellocomputer.xyz` → `http://10.0.0.138:8080`
- Authelia protection: recommended (no built-in auth in app)
- DNS: `burnin.hellocomputer.xyz` CNAME → `sandon.hellocomputer.xyz` (proxied: false)