# TrueNAS Burn-In Dashboard — Project Context > Drop this file in any new Claude session to resume work with full context. > Last updated: 2026-02-22 (Stage 6d) --- ## What This Is A self-hosted web dashboard for running and tracking hard-drive burn-in tests against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138). - **App URL**: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz) - **Stack path on maple.local**: `~/docker/stacks/truenas-burnin/` - **Source (local mac)**: `~/Desktop/claude-sandbox/truenas-burnin/` - **Compose synced to maple.local** via `scp` or manual copy ### Stages completed | Stage | Description | Status | |-------|-------------|--------| | 1 | Mock TrueNAS CORE v2.0 API (15 drives, sda–sdo) | ✅ | | 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ | | 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ | | 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ | | 5 | History page, job detail page, CSV export | ✅ | | 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ | | 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ | | 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ | | 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ | | 7 | Cut to real TrueNAS | 🔲 future | --- ## File Map ``` truenas-burnin/ ├── docker-compose.yml # two services: mock-truenas + app ├── Dockerfile # app container ├── requirements.txt ├── .env.example ├── data/ # SQLite DB lives here (gitignored, created on deploy) │ ├── mock-truenas/ │ ├── Dockerfile │ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API │ └── app/ ├── __init__.py ├── config.py # pydantic-settings; reads .env ├── database.py # schema, migrations, init_db(), get_db() ├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property ├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json ├── truenas.py # httpx async client with retry (lambda factory pattern) ├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check ├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs() ├── notifier.py # webhook + immediate email alerts on job completion ├── mailer.py # daily HTML email + per-job alert email ├── logging_config.py # structured JSON logging ├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …) ├── routes.py # all FastAPI route handlers ├── main.py # app factory, IP allowlist middleware, lifespan │ ├── static/ │ ├── app.css # full dark theme + mobile responsive │ └── app.js # push notifications, batch, elapsed timers, inline edit │ └── templates/ ├── layout.html # header nav: History, Stats, Audit, Settings, bell button ├── dashboard.html # stats bar, failed banner, batch bar ├── history.html ├── job_detail.html # + Print/Export button ├── audit.html # audit event log ├── stats.html # analytics: pass rate by model, daily activity ├── settings.html # editable 2-col form: SMTP (left) + Notifications/Behavior/Webhook (right) ├── job_print.html # print view with client-side QR code (qrcodejs CDN) └── components/ ├── drives_table.html # checkboxes, elapsed time, location inline edit ├── modal_start.html # single-drive burn-in modal └── modal_batch.html # batch burn-in modal ``` --- ## Architecture Overview ``` Browser ──HTMX SSE──▶ GET /sse/drives │ poller.subscribe() │ asyncio.Queue ◀─── poller.run() notifies after each poll │ & after each burnin stage update render drives_table.html yield SSE "drives-update" event ``` - **Poller** (`poller.py`): runs every `POLL_INTERVAL_SECONDS` (default 12s), calls TrueNAS `/api/v2.0/disk` and `/api/v2.0/core/get_jobs`, writes to SQLite, notifies SSE subscribers - **Burn-in** (`burnin.py`): `asyncio.Semaphore(max_parallel_burnins)` gates concurrency. Jobs are created immediately (queued state), semaphore gates actual execution. On startup, any interrupted running jobs → state=unknown; queued jobs are re-enqueued. - **SSE** (`routes.py /sse/drives`): one persistent connection per browser tab. Renders fresh `drives_table.html` HTML fragment on every notification. - **HTMX** (`dashboard.html`): `hx-ext="sse"` + `sse-swap="drives-update"` replaces `#drives-tbody` content without page reload. --- ## Database Schema (SQLite WAL mode) ```sql -- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier) drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes, temperature_c, smart_health, last_polled_at) -- smart_tests: one row per drive+test_type combination (UNIQUE constraint) smart_tests (id, drive_id FK, test_type CHECK('short','long'), state, percent, started_at, eta_at, finished_at, error_text, UNIQUE(drive_id, test_type)) -- burnin_jobs: one row per burn-in run (multiple per drive over time) burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/ failed/cancelled/unknown), percent, stage_name, operator, created_at, started_at, finished_at, error_text) -- burnin_stages: one row per stage per job burnin_stages (id, burnin_job_id FK, stage_name, state, percent, started_at, finished_at, error_text) -- audit_events: append-only log audit_events (id, event_type, drive_id, job_id, operator, note, created_at) ``` --- ## Burn-In Stage Definitions ```python STAGE_ORDER = { "quick": ["precheck", "short_smart", "io_validate", "final_check"], "full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"], } ``` The UI only exposes **full** profile (destructive). Quick profile exists for dev/testing. --- ## TrueNAS API Contracts Used | Method | Endpoint | Notes | |--------|----------|-------| | GET | `/api/v2.0/disk` | List all disks | | POST | `/api/v2.0/smart/test` | Start SMART test `{disks:[name], type:"SHORT"\|"LONG"}` | | GET | `/api/v2.0/core/get_jobs` | Filter `[["method","=","smart.test"]]` | | POST | `/api/v2.0/core/job_abort` | `job_id` positional arg | | GET | `/api/v2.0/smart/test/results/{disk}` | Per-disk SMART results | Auth: `Authorization: Bearer {TRUENAS_API_KEY}` header. --- ## Config / Environment Variables All read from `.env` via `pydantic-settings`. See `.env.example` for full list. | Variable | Default | Notes | |----------|---------|-------| | `APP_HOST` | `0.0.0.0` | | | `APP_PORT` | `8080` | | | `DB_PATH` | `/data/app.db` | Inside container | | `TRUENAS_BASE_URL` | `http://localhost:8000` | Point at mock or real TrueNAS | | `TRUENAS_API_KEY` | `mock-key` | Real API key for prod | | `TRUENAS_VERIFY_TLS` | `false` | Set true for prod with valid cert | | `POLL_INTERVAL_SECONDS` | `12` | | | `STALE_THRESHOLD_SECONDS` | `45` | UI shows warning if data older than this | | `MAX_PARALLEL_BURNINS` | `2` | asyncio.Semaphore limit | | `SURFACE_VALIDATE_SECONDS` | `45` | Mock only — duration of surface stage | | `IO_VALIDATE_SECONDS` | `25` | Mock only — duration of I/O stage | | `STUCK_JOB_HOURS` | `24` | Hours before a running job is auto-marked unknown | | `LOG_LEVEL` | `INFO` | | | `ALLOWED_IPS` | `` | Empty = allow all. Comma-sep IPs/CIDRs | | `SMTP_HOST` | `` | Empty = email disabled | | `SMTP_PORT` | `587` | | | `SMTP_USER` | `` | | | `SMTP_PASSWORD` | `` | | | `SMTP_FROM` | `` | | | `SMTP_TO` | `` | Comma-separated | | `SMTP_REPORT_HOUR` | `8` | Local hour (0-23) to send daily report | | `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails | | `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes | | `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n | --- ## Deploy Workflow ### First deploy (already done) ```bash # On maple.local cd ~/docker/stacks/truenas-burnin docker compose up -d --build ``` ### Redeploy after code changes ```bash # Copy changed files from mac to maple.local first, e.g.: scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/ # Then on maple.local: ssh brandon@10.0.0.138 -p 2225 cd ~/docker/stacks/truenas-burnin docker compose up -d --build ``` ### Reset the database (e.g. after schema changes) ```bash # On maple.local — stop containers first docker compose stop app # Delete DB using alpine (container owns the file, sudo not available) docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db docker compose start app ``` ### Check logs ```bash docker compose logs -f app docker compose logs -f mock-truenas ``` --- ## Mock TrueNAS Server (`mock-truenas/app.py`) - 15 drives: `sda`–`sdo` - Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB, 2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB, 1× Seagate Barracuda ST6000DM003 6TB, 1× **FAIL001** (sdn) — always fails at ~30% - SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s - Debug endpoints: - `POST /debug/reset` — reset all jobs/state - `GET /debug/state` — dump current state - `POST /debug/complete-all-jobs` — instantly complete all running tests --- ## Key Implementation Patterns ### Retry pattern — lambda factory (NOT coroutine object) ```python # CORRECT: pass a factory so each retry creates a fresh coroutine r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks") # WRONG: coroutine is exhausted after first await, retry silently fails r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks") ``` ### SSE template rendering ```python # Use templates.env.get_template().render() — not TemplateResponse (that's a Response object) html = templates.env.get_template("components/drives_table.html").render(drives=drives) yield {"event": "drives-update", "data": html} ``` ### Sticky thead scroll fix ```css /* BOTH axes required on table-wrap for position:sticky to work on thead */ .table-wrap { overflow: auto; /* NOT overflow-x: auto */ max-height: calc(100vh - 130px); } thead { position: sticky; top: 0; z-index: 10; } ``` ### export.csv route ordering ```python # MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv" @router.get("/api/v1/burnin/export.csv") # first async def burnin_export_csv(...): ... @router.get("/api/v1/burnin/{job_id}") # second async def burnin_get(job_id: int, ...): ... ``` --- ## Known Issues / Past Bugs Fixed | Bug | Root Cause | Fix | |-----|-----------|-----| | `_execute_stages` used `STAGE_ORDER[profile]` ignoring custom order | Stage order stored in DB but not read back | `_run_job` reads stages from `burnin_stages ORDER BY id`; `_execute_stages` accepts `stages: list[str]` | | Poller stuck at 'running' after completion | `_sync_history()` had early-return guard when state=running | Removed guard — `_sync_history` only called when job not in active dict | | DB schema tables missing after edit | Tables split into separate variable never passed to `executescript()` | Put all tables in single `SCHEMA` string | | Retry not retrying | `_with_retry(coro)` — coroutine exhausted after first fail | Changed to `_with_retry(factory: Callable[[], Coroutine])` | | `error_text` overwritten | `_finish_stage(success=False)` overwrote error set by stage handler | `_finish_stage` omits `error_text` column in SQL when param is None | | Cancelled stage showed 'failed' | `_execute_stages` called `_finish_stage(success=False)` on cancel | Check `_is_cancelled()`, call `_cancel_stage()` instead | | export.csv returns 422 | Route registered after `/{job_id}`, FastAPI tries `int("export.csv")` | Move export route before parameterized route | | Old drive names persist after mock rename | Poller upserts by `truenas_disk_id`, old rows stay | Delete `app.db` and restart | | First row clipped behind sticky thead | `overflow-x: auto` only creates partial stacking context | Use `overflow: auto` (both axes) on `.table-wrap` | | `rm data/app.db` permission denied | Container owns the file | Use `docker run --rm -v .../data:/data alpine rm -f /data/app.db` | | First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` | | SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 | | SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections | --- ## Stage 7 — Cutting to Real TrueNAS (TODO) When ready to test against a real TrueNAS CORE box: 1. In `.env` on maple.local, set: ```env TRUENAS_BASE_URL=https://10.0.0.203 # or whatever your TrueNAS IP is TRUENAS_API_KEY=your-real-key-here TRUENAS_VERIFY_TLS=false # unless you have a valid cert ``` 2. Comment out `mock-truenas` service in `docker-compose.yml` (or leave it running — harmless) 3. Verify TrueNAS CORE v2.0 API contract matches what `truenas.py` expects: - `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature` - `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]` - `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}` 4. Check that disk names match expected format (TrueNAS CORE uses `ada0`, `da0`, etc. — not `sda`) - You may need to update mock drive names back or adjust poller logic 5. Delete `app.db` to clear mock drive rows before first real poll --- ## Feature Reference (Stage 6b) ### New Pages | URL | Description | |-----|-------------| | `/stats` | Analytics — pass rate by model, daily activity last 14 days | | `/audit` | Audit log — last 200 events with drive/operator context | | `/settings` | Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) | | `/history/{id}/print` | Print-friendly job report with QR code | ### New API Routes (6b + 6c) | Method | Path | Description | |--------|------|-------------| | `PATCH` | `/api/v1/drives/{id}` | Update `notes` and/or `location` | | `POST` | `/api/v1/settings` | Save runtime settings to `/data/settings_overrides.json` | | `POST` | `/api/v1/settings/test-smtp` | Test SMTP connection without sending email | ### Notifications - **Browser push**: Bell icon in header → `Notification.requestPermission()`. Fires on `job-alert` SSE event (burnin pass/fail). - **SSE alert event**: `job-alert` event type on `/sse/drives`. JS listens via `htmx:sseMessage`. - **Immediate email**: `send_job_alert()` in mailer.py. Triggered by `notifier.notify_job_complete()` from burnin.py. - **Webhook**: `notifier._send_webhook()` — POST JSON to `WEBHOOK_URL`. Payload includes event, job_id, devname, serial, model, state, operator, error_text. ### Stuck Job Detection - `burnin.check_stuck_jobs()` runs every 5 poll cycles (~1 min) - Jobs running longer than `STUCK_JOB_HOURS` (default 24h) → state=unknown - Logged at CRITICAL level; audit event written ### Batch Burn-In - Checkboxes on each idle/selectable drive row - Batch bar appears in filter row when any drives selected - Uses existing `POST /api/v1/burnin/start` with multiple `drive_ids` - Requires operator name + explicit confirmation checkbox (no serial required) - JS `checkedDriveIds` Set persists across SSE swaps via `restoreCheckboxes()` ### Drive Location - `location` and `notes` fields added to drives table via ALTER TABLE migration - Inline click-to-edit on location field in drive name cell - Saves via `PATCH /api/v1/drives/{id}` on blur/Enter; restores on Escape ## Feature Reference (Stage 6c) ### Settings Page - Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right) - Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge - All changes save instantly via `POST /api/v1/settings` → `settings_store.save()` → `/data/settings_overrides.json` - Overrides loaded on startup in `main.py` lifespan via `settings_store.init()` - Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25 - Test Connection button at top of SMTP card — tests live settings without sending email - Brand logo in header is now a clickable `` home link ### SMTP Port Derivation ```python # mailer.py — port is derived from mode, NOT from settings.smtp_port _MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25} port = _MODE_PORTS.get(mode, 587) ``` Never use `settings.smtp_port` in mailer — it's kept in config for `.env` backward compat only. ### Burn-In Stage Selection `StartBurninRequest` no longer takes `profile: str`. Instead takes: - `run_surface: bool = True` — surface validate (destructive write test) - `run_short: bool = True` — Short SMART (non-destructive) - `run_long: bool = True` — Long SMART (non-destructive) Profile string is computed as a property. Profiles: `full`, `surface_short`, `surface_long`, `surface`, `short_long`, `short`, `long`. Precheck and final_check always run. `STAGE_ORDER` in `burnin.py` has all 7 profile combinations. `_recalculate_progress()` uses `_STAGE_BASE_WEIGHTS` dict (per-stage weights) and computes overall % dynamically from actual `burnin_stages` rows — no profile lookup needed. In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked: - Destructive warning is hidden - Serial confirmation field is hidden (single modal) - Confirmation checkbox is hidden (batch modal) ### Table Scroll Fix ```css .table-wrap { max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */ } ``` If stats bar or other content height changes, update this offset. ## Feature Reference (Stage 6d) ### Cancel Functionality | What | How | |------|-----| | Cancel running Short SMART | `✕ Short` button appears in action col when `short_busy`; calls `POST /api/v1/drives/{id}/smart/cancel` with `{type:"short"}` | | Cancel running Long SMART | `✕ Long` button appears when `long_busy`; same route with `{type:"long"}` | | Cancel individual burn-in | `✕ Burn-In` button (was "Cancel") shown when `bi_active`; calls `POST /api/v1/burnin/{id}/cancel` | | Cancel All Running | Red `✕ Cancel All Burn-Ins` button appears in filter bar when any burn-in jobs are active; JS collects all `.btn-cancel[data-job-id]` and cancels each | **SMART cancel route** (`POST /api/v1/drives/{drive_id}/smart/cancel`): 1. Fetches all running TrueNAS jobs via `client.get_smart_jobs()` 2. Finds job where `arguments[0].disks` contains the drive's devname 3. Calls `client.abort_job(tn_job_id)` 4. Updates `smart_tests` table row to `state='aborted'` ### Stage Reordering - Default order changed to: **Short SMART → Long SMART → Surface Validate** (non-destructive first) - Drag handles (⠿) on each stage row in both single and batch modals - HTML5 drag-and-drop, no external library - `getStageOrder(listId)` reads current DOM order of checked stages - `stage_order: ["short_smart","long_smart","surface_validate"]` sent in API body - `StartBurninRequest.stage_order: list[str] | None` — validated against allowed stage names - `burnin.start_job()` accepts `stage_order` param; builds: `["precheck"] + stage_order + ["final_check"]` - `_run_job()` reads stage names back from `burnin_stages ORDER BY id` — so custom order is honoured - Destructive warning / serial confirmation still triggered by `stage-surface` checkbox ID (order-independent) ## NPM / DNS Setup - Proxy host: `burnin.hellocomputer.xyz` → `http://10.0.0.138:8080` - Authelia protection: recommended (no built-in auth in app) - DNS: `burnin.hellocomputer.xyz` CNAME → `sandon.hellocomputer.xyz` (proxied: false)