truenas-burnin/CLAUDE.md
Brandon Walter b73b5251ae Initial commit — TrueNAS Burn-In Dashboard v0.5.0
Full-stack burn-in orchestration dashboard (Stages 1–6d complete):
FastAPI backend, SQLite/WAL, SSE live dashboard, mock TrueNAS server,
SMTP/webhook notifications, batch burn-in, settings UI, audit log,
stats page, cancel SMART/burn-in, drag-to-reorder stages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 00:08:29 -05:00

449 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# TrueNAS Burn-In Dashboard — Project Context
> Drop this file in any new Claude session to resume work with full context.
> Last updated: 2026-02-22 (Stage 6d)
---
## What This Is
A self-hosted web dashboard for running and tracking hard-drive burn-in tests
against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138).
- **App URL**: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz)
- **Stack path on maple.local**: `~/docker/stacks/truenas-burnin/`
- **Source (local mac)**: `~/Desktop/claude-sandbox/truenas-burnin/`
- **Compose synced to maple.local** via `scp` or manual copy
### Stages completed
| Stage | Description | Status |
|-------|-------------|--------|
| 1 | Mock TrueNAS CORE v2.0 API (15 drives, sdasdo) | ✅ |
| 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ |
| 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ |
| 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ |
| 5 | History page, job detail page, CSV export | ✅ |
| 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ |
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
| 7 | Cut to real TrueNAS | 🔲 future |
---
## File Map
```
truenas-burnin/
├── docker-compose.yml # two services: mock-truenas + app
├── Dockerfile # app container
├── requirements.txt
├── .env.example
├── data/ # SQLite DB lives here (gitignored, created on deploy)
├── mock-truenas/
│ ├── Dockerfile
│ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API
└── app/
├── __init__.py
├── config.py # pydantic-settings; reads .env
├── database.py # schema, migrations, init_db(), get_db()
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
├── truenas.py # httpx async client with retry (lambda factory pattern)
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
├── notifier.py # webhook + immediate email alerts on job completion
├── mailer.py # daily HTML email + per-job alert email
├── logging_config.py # structured JSON logging
├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
├── routes.py # all FastAPI route handlers
├── main.py # app factory, IP allowlist middleware, lifespan
├── static/
│ ├── app.css # full dark theme + mobile responsive
│ └── app.js # push notifications, batch, elapsed timers, inline edit
└── templates/
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
├── dashboard.html # stats bar, failed banner, batch bar
├── history.html
├── job_detail.html # + Print/Export button
├── audit.html # audit event log
├── stats.html # analytics: pass rate by model, daily activity
├── settings.html # editable 2-col form: SMTP (left) + Notifications/Behavior/Webhook (right)
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
└── components/
├── drives_table.html # checkboxes, elapsed time, location inline edit
├── modal_start.html # single-drive burn-in modal
└── modal_batch.html # batch burn-in modal
```
---
## Architecture Overview
```
Browser ──HTMX SSE──▶ GET /sse/drives
poller.subscribe()
asyncio.Queue ◀─── poller.run() notifies after each poll
│ & after each burnin stage update
render drives_table.html
yield SSE "drives-update" event
```
- **Poller** (`poller.py`): runs every `POLL_INTERVAL_SECONDS` (default 12s), calls
TrueNAS `/api/v2.0/disk` and `/api/v2.0/core/get_jobs`, writes to SQLite,
notifies SSE subscribers
- **Burn-in** (`burnin.py`): `asyncio.Semaphore(max_parallel_burnins)` gates
concurrency. Jobs are created immediately (queued state), semaphore gates
actual execution. On startup, any interrupted running jobs → state=unknown;
queued jobs are re-enqueued.
- **SSE** (`routes.py /sse/drives`): one persistent connection per browser tab.
Renders fresh `drives_table.html` HTML fragment on every notification.
- **HTMX** (`dashboard.html`): `hx-ext="sse"` + `sse-swap="drives-update"`
replaces `#drives-tbody` content without page reload.
---
## Database Schema (SQLite WAL mode)
```sql
-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
temperature_c, smart_health, last_polled_at)
-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
state, percent, started_at, eta_at, finished_at, error_text,
UNIQUE(drive_id, test_type))
-- burnin_jobs: one row per burn-in run (multiple per drive over time)
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
failed/cancelled/unknown), percent, stage_name, operator,
created_at, started_at, finished_at, error_text)
-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
started_at, finished_at, error_text)
-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
```
---
## Burn-In Stage Definitions
```python
STAGE_ORDER = {
"quick": ["precheck", "short_smart", "io_validate", "final_check"],
"full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
}
```
The UI only exposes **full** profile (destructive). Quick profile exists for dev/testing.
---
## TrueNAS API Contracts Used
| Method | Endpoint | Notes |
|--------|----------|-------|
| GET | `/api/v2.0/disk` | List all disks |
| POST | `/api/v2.0/smart/test` | Start SMART test `{disks:[name], type:"SHORT"\|"LONG"}` |
| GET | `/api/v2.0/core/get_jobs` | Filter `[["method","=","smart.test"]]` |
| POST | `/api/v2.0/core/job_abort` | `job_id` positional arg |
| GET | `/api/v2.0/smart/test/results/{disk}` | Per-disk SMART results |
Auth: `Authorization: Bearer {TRUENAS_API_KEY}` header.
---
## Config / Environment Variables
All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
| Variable | Default | Notes |
|----------|---------|-------|
| `APP_HOST` | `0.0.0.0` | |
| `APP_PORT` | `8080` | |
| `DB_PATH` | `/data/app.db` | Inside container |
| `TRUENAS_BASE_URL` | `http://localhost:8000` | Point at mock or real TrueNAS |
| `TRUENAS_API_KEY` | `mock-key` | Real API key for prod |
| `TRUENAS_VERIFY_TLS` | `false` | Set true for prod with valid cert |
| `POLL_INTERVAL_SECONDS` | `12` | |
| `STALE_THRESHOLD_SECONDS` | `45` | UI shows warning if data older than this |
| `MAX_PARALLEL_BURNINS` | `2` | asyncio.Semaphore limit |
| `SURFACE_VALIDATE_SECONDS` | `45` | Mock only — duration of surface stage |
| `IO_VALIDATE_SECONDS` | `25` | Mock only — duration of I/O stage |
| `STUCK_JOB_HOURS` | `24` | Hours before a running job is auto-marked unknown |
| `LOG_LEVEL` | `INFO` | |
| `ALLOWED_IPS` | `` | Empty = allow all. Comma-sep IPs/CIDRs |
| `SMTP_HOST` | `` | Empty = email disabled |
| `SMTP_PORT` | `587` | |
| `SMTP_USER` | `` | |
| `SMTP_PASSWORD` | `` | |
| `SMTP_FROM` | `` | |
| `SMTP_TO` | `` | Comma-separated |
| `SMTP_REPORT_HOUR` | `8` | Local hour (0-23) to send daily report |
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
---
## Deploy Workflow
### First deploy (already done)
```bash
# On maple.local
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
```
### Redeploy after code changes
```bash
# Copy changed files from mac to maple.local first, e.g.:
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/
# Then on maple.local:
ssh brandon@10.0.0.138 -p 2225
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
```
### Reset the database (e.g. after schema changes)
```bash
# On maple.local — stop containers first
docker compose stop app
# Delete DB using alpine (container owns the file, sudo not available)
docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db
docker compose start app
```
### Check logs
```bash
docker compose logs -f app
docker compose logs -f mock-truenas
```
---
## Mock TrueNAS Server (`mock-truenas/app.py`)
- 15 drives: `sda``sdo`
- Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB,
2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB,
1× Seagate Barracuda ST6000DM003 6TB, 1× **FAIL001** (sdn) — always fails at ~30%
- SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
- Debug endpoints:
- `POST /debug/reset` — reset all jobs/state
- `GET /debug/state` — dump current state
- `POST /debug/complete-all-jobs` — instantly complete all running tests
---
## Key Implementation Patterns
### Retry pattern — lambda factory (NOT coroutine object)
```python
# CORRECT: pass a factory so each retry creates a fresh coroutine
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")
# WRONG: coroutine is exhausted after first await, retry silently fails
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")
```
### SSE template rendering
```python
# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
yield {"event": "drives-update", "data": html}
```
### Sticky thead scroll fix
```css
/* BOTH axes required on table-wrap for position:sticky to work on thead */
.table-wrap {
overflow: auto; /* NOT overflow-x: auto */
max-height: calc(100vh - 130px);
}
thead { position: sticky; top: 0; z-index: 10; }
```
### export.csv route ordering
```python
# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
@router.get("/api/v1/burnin/export.csv") # first
async def burnin_export_csv(...): ...
@router.get("/api/v1/burnin/{job_id}") # second
async def burnin_get(job_id: int, ...): ...
```
---
## Known Issues / Past Bugs Fixed
| Bug | Root Cause | Fix |
|-----|-----------|-----|
| `_execute_stages` used `STAGE_ORDER[profile]` ignoring custom order | Stage order stored in DB but not read back | `_run_job` reads stages from `burnin_stages ORDER BY id`; `_execute_stages` accepts `stages: list[str]` |
| Poller stuck at 'running' after completion | `_sync_history()` had early-return guard when state=running | Removed guard — `_sync_history` only called when job not in active dict |
| DB schema tables missing after edit | Tables split into separate variable never passed to `executescript()` | Put all tables in single `SCHEMA` string |
| Retry not retrying | `_with_retry(coro)` — coroutine exhausted after first fail | Changed to `_with_retry(factory: Callable[[], Coroutine])` |
| `error_text` overwritten | `_finish_stage(success=False)` overwrote error set by stage handler | `_finish_stage` omits `error_text` column in SQL when param is None |
| Cancelled stage showed 'failed' | `_execute_stages` called `_finish_stage(success=False)` on cancel | Check `_is_cancelled()`, call `_cancel_stage()` instead |
| export.csv returns 422 | Route registered after `/{job_id}`, FastAPI tries `int("export.csv")` | Move export route before parameterized route |
| Old drive names persist after mock rename | Poller upserts by `truenas_disk_id`, old rows stay | Delete `app.db` and restart |
| First row clipped behind sticky thead | `overflow-x: auto` only creates partial stacking context | Use `overflow: auto` (both axes) on `.table-wrap` |
| `rm data/app.db` permission denied | Container owns the file | Use `docker run --rm -v .../data:/data alpine rm -f /data/app.db` |
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
---
## Stage 7 — Cutting to Real TrueNAS (TODO)
When ready to test against a real TrueNAS CORE box:
1. In `.env` on maple.local, set:
```env
TRUENAS_BASE_URL=https://10.0.0.203 # or whatever your TrueNAS IP is
TRUENAS_API_KEY=your-real-key-here
TRUENAS_VERIFY_TLS=false # unless you have a valid cert
```
2. Comment out `mock-truenas` service in `docker-compose.yml` (or leave it running — harmless)
3. Verify TrueNAS CORE v2.0 API contract matches what `truenas.py` expects:
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
4. Check that disk names match expected format (TrueNAS CORE uses `ada0`, `da0`, etc. — not `sda`)
- You may need to update mock drive names back or adjust poller logic
5. Delete `app.db` to clear mock drive rows before first real poll
---
## Feature Reference (Stage 6b)
### New Pages
| URL | Description |
|-----|-------------|
| `/stats` | Analytics — pass rate by model, daily activity last 14 days |
| `/audit` | Audit log — last 200 events with drive/operator context |
| `/settings` | Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) |
| `/history/{id}/print` | Print-friendly job report with QR code |
### New API Routes (6b + 6c)
| Method | Path | Description |
|--------|------|-------------|
| `PATCH` | `/api/v1/drives/{id}` | Update `notes` and/or `location` |
| `POST` | `/api/v1/settings` | Save runtime settings to `/data/settings_overrides.json` |
| `POST` | `/api/v1/settings/test-smtp` | Test SMTP connection without sending email |
### Notifications
- **Browser push**: Bell icon in header → `Notification.requestPermission()`. Fires on `job-alert` SSE event (burnin pass/fail).
- **SSE alert event**: `job-alert` event type on `/sse/drives`. JS listens via `htmx:sseMessage`.
- **Immediate email**: `send_job_alert()` in mailer.py. Triggered by `notifier.notify_job_complete()` from burnin.py.
- **Webhook**: `notifier._send_webhook()` — POST JSON to `WEBHOOK_URL`. Payload includes event, job_id, devname, serial, model, state, operator, error_text.
### Stuck Job Detection
- `burnin.check_stuck_jobs()` runs every 5 poll cycles (~1 min)
- Jobs running longer than `STUCK_JOB_HOURS` (default 24h) → state=unknown
- Logged at CRITICAL level; audit event written
### Batch Burn-In
- Checkboxes on each idle/selectable drive row
- Batch bar appears in filter row when any drives selected
- Uses existing `POST /api/v1/burnin/start` with multiple `drive_ids`
- Requires operator name + explicit confirmation checkbox (no serial required)
- JS `checkedDriveIds` Set persists across SSE swaps via `restoreCheckboxes()`
### Drive Location
- `location` and `notes` fields added to drives table via ALTER TABLE migration
- Inline click-to-edit on location field in drive name cell
- Saves via `PATCH /api/v1/drives/{id}` on blur/Enter; restores on Escape
## Feature Reference (Stage 6c)
### Settings Page
- Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
- Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
- All changes save instantly via `POST /api/v1/settings` → `settings_store.save()` → `/data/settings_overrides.json`
- Overrides loaded on startup in `main.py` lifespan via `settings_store.init()`
- Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
- Test Connection button at top of SMTP card — tests live settings without sending email
- Brand logo in header is now a clickable `<a href="/">` home link
### SMTP Port Derivation
```python
# mailer.py — port is derived from mode, NOT from settings.smtp_port
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
port = _MODE_PORTS.get(mode, 587)
```
Never use `settings.smtp_port` in mailer — it's kept in config for `.env` backward compat only.
### Burn-In Stage Selection
`StartBurninRequest` no longer takes `profile: str`. Instead takes:
- `run_surface: bool = True` — surface validate (destructive write test)
- `run_short: bool = True` — Short SMART (non-destructive)
- `run_long: bool = True` — Long SMART (non-destructive)
Profile string is computed as a property. Profiles: `full`, `surface_short`, `surface_long`,
`surface`, `short_long`, `short`, `long`. Precheck and final_check always run.
`STAGE_ORDER` in `burnin.py` has all 7 profile combinations.
`_recalculate_progress()` uses `_STAGE_BASE_WEIGHTS` dict (per-stage weights) and computes
overall % dynamically from actual `burnin_stages` rows — no profile lookup needed.
In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:
- Destructive warning is hidden
- Serial confirmation field is hidden (single modal)
- Confirmation checkbox is hidden (batch modal)
### Table Scroll Fix
```css
.table-wrap {
max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
}
```
If stats bar or other content height changes, update this offset.
## Feature Reference (Stage 6d)
### Cancel Functionality
| What | How |
|------|-----|
| Cancel running Short SMART | `✕ Short` button appears in action col when `short_busy`; calls `POST /api/v1/drives/{id}/smart/cancel` with `{type:"short"}` |
| Cancel running Long SMART | `✕ Long` button appears when `long_busy`; same route with `{type:"long"}` |
| Cancel individual burn-in | `✕ Burn-In` button (was "Cancel") shown when `bi_active`; calls `POST /api/v1/burnin/{id}/cancel` |
| Cancel All Running | Red `✕ Cancel All Burn-Ins` button appears in filter bar when any burn-in jobs are active; JS collects all `.btn-cancel[data-job-id]` and cancels each |
**SMART cancel route** (`POST /api/v1/drives/{drive_id}/smart/cancel`):
1. Fetches all running TrueNAS jobs via `client.get_smart_jobs()`
2. Finds job where `arguments[0].disks` contains the drive's devname
3. Calls `client.abort_job(tn_job_id)`
4. Updates `smart_tests` table row to `state='aborted'`
### Stage Reordering
- Default order changed to: **Short SMART → Long SMART → Surface Validate** (non-destructive first)
- Drag handles (⠿) on each stage row in both single and batch modals
- HTML5 drag-and-drop, no external library
- `getStageOrder(listId)` reads current DOM order of checked stages
- `stage_order: ["short_smart","long_smart","surface_validate"]` sent in API body
- `StartBurninRequest.stage_order: list[str] | None` — validated against allowed stage names
- `burnin.start_job()` accepts `stage_order` param; builds: `["precheck"] + stage_order + ["final_check"]`
- `_run_job()` reads stage names back from `burnin_stages ORDER BY id` — so custom order is honoured
- Destructive warning / serial confirmation still triggered by `stage-surface` checkbox ID (order-independent)
## NPM / DNS Setup
- Proxy host: `burnin.hellocomputer.xyz` → `http://10.0.0.138:8080`
- Authelia protection: recommended (no built-in auth in app)
- DNS: `burnin.hellocomputer.xyz` CNAME → `sandon.hellocomputer.xyz` (proxied: false)