Full-stack burn-in orchestration dashboard (Stages 1–6d complete): FastAPI backend, SQLite/WAL, SSE live dashboard, mock TrueNAS server, SMTP/webhook notifications, batch burn-in, settings UI, audit log, stats page, cancel SMART/burn-in, drag-to-reorder stages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20 KiB
TrueNAS Burn-In Dashboard — Project Context
Drop this file in any new Claude session to resume work with full context. Last updated: 2026-02-22 (Stage 6d)
What This Is
A self-hosted web dashboard for running and tracking hard-drive burn-in tests against a TrueNAS CORE instance. Deployed on maple.local (10.0.0.138).
- App URL: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz)
- Stack path on maple.local:
~/docker/stacks/truenas-burnin/ - Source (local mac):
~/Desktop/claude-sandbox/truenas-burnin/ - Compose synced to maple.local via
scpor manual copy
Stages completed
| Stage | Description | Status |
|---|---|---|
| 1 | Mock TrueNAS CORE v2.0 API (15 drives, sda–sdo) | ✅ |
| 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ |
| 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ |
| 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ |
| 5 | History page, job detail page, CSV export | ✅ |
| 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ |
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
| 7 | Cut to real TrueNAS | 🔲 future |
File Map
truenas-burnin/
├── docker-compose.yml # two services: mock-truenas + app
├── Dockerfile # app container
├── requirements.txt
├── .env.example
├── data/ # SQLite DB lives here (gitignored, created on deploy)
│
├── mock-truenas/
│ ├── Dockerfile
│ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API
│
└── app/
├── __init__.py
├── config.py # pydantic-settings; reads .env
├── database.py # schema, migrations, init_db(), get_db()
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
├── truenas.py # httpx async client with retry (lambda factory pattern)
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
├── notifier.py # webhook + immediate email alerts on job completion
├── mailer.py # daily HTML email + per-job alert email
├── logging_config.py # structured JSON logging
├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
├── routes.py # all FastAPI route handlers
├── main.py # app factory, IP allowlist middleware, lifespan
│
├── static/
│ ├── app.css # full dark theme + mobile responsive
│ └── app.js # push notifications, batch, elapsed timers, inline edit
│
└── templates/
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
├── dashboard.html # stats bar, failed banner, batch bar
├── history.html
├── job_detail.html # + Print/Export button
├── audit.html # audit event log
├── stats.html # analytics: pass rate by model, daily activity
├── settings.html # editable 2-col form: SMTP (left) + Notifications/Behavior/Webhook (right)
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
└── components/
├── drives_table.html # checkboxes, elapsed time, location inline edit
├── modal_start.html # single-drive burn-in modal
└── modal_batch.html # batch burn-in modal
Architecture Overview
Browser ──HTMX SSE──▶ GET /sse/drives
│
poller.subscribe()
│
asyncio.Queue ◀─── poller.run() notifies after each poll
│ & after each burnin stage update
render drives_table.html
yield SSE "drives-update" event
- Poller (
poller.py): runs everyPOLL_INTERVAL_SECONDS(default 12s), calls TrueNAS/api/v2.0/diskand/api/v2.0/core/get_jobs, writes to SQLite, notifies SSE subscribers - Burn-in (
burnin.py):asyncio.Semaphore(max_parallel_burnins)gates concurrency. Jobs are created immediately (queued state), semaphore gates actual execution. On startup, any interrupted running jobs → state=unknown; queued jobs are re-enqueued. - SSE (
routes.py /sse/drives): one persistent connection per browser tab. Renders freshdrives_table.htmlHTML fragment on every notification. - HTMX (
dashboard.html):hx-ext="sse"+sse-swap="drives-update"replaces#drives-tbodycontent without page reload.
Database Schema (SQLite WAL mode)
-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
temperature_c, smart_health, last_polled_at)
-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
state, percent, started_at, eta_at, finished_at, error_text,
UNIQUE(drive_id, test_type))
-- burnin_jobs: one row per burn-in run (multiple per drive over time)
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
failed/cancelled/unknown), percent, stage_name, operator,
created_at, started_at, finished_at, error_text)
-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
started_at, finished_at, error_text)
-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
Burn-In Stage Definitions
STAGE_ORDER = {
"quick": ["precheck", "short_smart", "io_validate", "final_check"],
"full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
}
The UI only exposes full profile (destructive). Quick profile exists for dev/testing.
TrueNAS API Contracts Used
| Method | Endpoint | Notes |
|---|---|---|
| GET | /api/v2.0/disk |
List all disks |
| POST | /api/v2.0/smart/test |
Start SMART test {disks:[name], type:"SHORT"|"LONG"} |
| GET | /api/v2.0/core/get_jobs |
Filter [["method","=","smart.test"]] |
| POST | /api/v2.0/core/job_abort |
job_id positional arg |
| GET | /api/v2.0/smart/test/results/{disk} |
Per-disk SMART results |
Auth: Authorization: Bearer {TRUENAS_API_KEY} header.
Config / Environment Variables
All read from .env via pydantic-settings. See .env.example for full list.
| Variable | Default | Notes |
|---|---|---|
APP_HOST |
0.0.0.0 |
|
APP_PORT |
8080 |
|
DB_PATH |
/data/app.db |
Inside container |
TRUENAS_BASE_URL |
http://localhost:8000 |
Point at mock or real TrueNAS |
TRUENAS_API_KEY |
mock-key |
Real API key for prod |
TRUENAS_VERIFY_TLS |
false |
Set true for prod with valid cert |
POLL_INTERVAL_SECONDS |
12 |
|
STALE_THRESHOLD_SECONDS |
45 |
UI shows warning if data older than this |
MAX_PARALLEL_BURNINS |
2 |
asyncio.Semaphore limit |
SURFACE_VALIDATE_SECONDS |
45 |
Mock only — duration of surface stage |
IO_VALIDATE_SECONDS |
25 |
Mock only — duration of I/O stage |
STUCK_JOB_HOURS |
24 |
Hours before a running job is auto-marked unknown |
LOG_LEVEL |
INFO |
|
ALLOWED_IPS |
`` | Empty = allow all. Comma-sep IPs/CIDRs |
SMTP_HOST |
`` | Empty = email disabled |
SMTP_PORT |
587 |
|
SMTP_USER |
`` | |
SMTP_PASSWORD |
`` | |
SMTP_FROM |
`` | |
SMTP_TO |
`` | Comma-separated |
SMTP_REPORT_HOUR |
8 |
Local hour (0-23) to send daily report |
SMTP_ALERT_ON_FAIL |
true |
Immediate email when a job fails |
SMTP_ALERT_ON_PASS |
false |
Immediate email when a job passes |
WEBHOOK_URL |
`` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
Deploy Workflow
First deploy (already done)
# On maple.local
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
Redeploy after code changes
# Copy changed files from mac to maple.local first, e.g.:
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/
# Then on maple.local:
ssh brandon@10.0.0.138 -p 2225
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build
Reset the database (e.g. after schema changes)
# On maple.local — stop containers first
docker compose stop app
# Delete DB using alpine (container owns the file, sudo not available)
docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db
docker compose start app
Check logs
docker compose logs -f app
docker compose logs -f mock-truenas
Mock TrueNAS Server (mock-truenas/app.py)
- 15 drives:
sda–sdo - Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB, 2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB, 1× Seagate Barracuda ST6000DM003 6TB, 1× FAIL001 (sdn) — always fails at ~30%
- SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
- Debug endpoints:
POST /debug/reset— reset all jobs/stateGET /debug/state— dump current statePOST /debug/complete-all-jobs— instantly complete all running tests
Key Implementation Patterns
Retry pattern — lambda factory (NOT coroutine object)
# CORRECT: pass a factory so each retry creates a fresh coroutine
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")
# WRONG: coroutine is exhausted after first await, retry silently fails
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")
SSE template rendering
# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
yield {"event": "drives-update", "data": html}
Sticky thead scroll fix
/* BOTH axes required on table-wrap for position:sticky to work on thead */
.table-wrap {
overflow: auto; /* NOT overflow-x: auto */
max-height: calc(100vh - 130px);
}
thead { position: sticky; top: 0; z-index: 10; }
export.csv route ordering
# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
@router.get("/api/v1/burnin/export.csv") # first
async def burnin_export_csv(...): ...
@router.get("/api/v1/burnin/{job_id}") # second
async def burnin_get(job_id: int, ...): ...
Known Issues / Past Bugs Fixed
| Bug | Root Cause | Fix |
|---|---|---|
_execute_stages used STAGE_ORDER[profile] ignoring custom order |
Stage order stored in DB but not read back | _run_job reads stages from burnin_stages ORDER BY id; _execute_stages accepts stages: list[str] |
| Poller stuck at 'running' after completion | _sync_history() had early-return guard when state=running |
Removed guard — _sync_history only called when job not in active dict |
| DB schema tables missing after edit | Tables split into separate variable never passed to executescript() |
Put all tables in single SCHEMA string |
| Retry not retrying | _with_retry(coro) — coroutine exhausted after first fail |
Changed to _with_retry(factory: Callable[[], Coroutine]) |
error_text overwritten |
_finish_stage(success=False) overwrote error set by stage handler |
_finish_stage omits error_text column in SQL when param is None |
| Cancelled stage showed 'failed' | _execute_stages called _finish_stage(success=False) on cancel |
Check _is_cancelled(), call _cancel_stage() instead |
| export.csv returns 422 | Route registered after /{job_id}, FastAPI tries int("export.csv") |
Move export route before parameterized route |
| Old drive names persist after mock rename | Poller upserts by truenas_disk_id, old rows stay |
Delete app.db and restart |
| First row clipped behind sticky thead | overflow-x: auto only creates partial stacking context |
Use overflow: auto (both axes) on .table-wrap |
rm data/app.db permission denied |
Container owns the file | Use docker run --rm -v .../data:/data alpine rm -f /data/app.db |
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | max-height: calc(100vh - 205px) |
| SMTP "Connection unexpectedly closed" | _send_email used settings.smtp_port (587 default) even in SSL mode |
Derive port from mode via _MODE_PORTS dict; SSL→465, STARTTLS→587, Plain→25 |
| SSL mode missing EHLO | smtplib.SMTP_SSL was created without calling ehlo() |
Added server.ehlo() after both SSL and STARTTLS connections |
Stage 7 — Cutting to Real TrueNAS (TODO)
When ready to test against a real TrueNAS CORE box:
- In
.envon maple.local, set:TRUENAS_BASE_URL=https://10.0.0.203 # or whatever your TrueNAS IP is TRUENAS_API_KEY=your-real-key-here TRUENAS_VERIFY_TLS=false # unless you have a valid cert - Comment out
mock-truenasservice indocker-compose.yml(or leave it running — harmless) - Verify TrueNAS CORE v2.0 API contract matches what
truenas.pyexpects:GET /api/v2.0/diskreturns list withname,serial,model,size,temperatureGET /api/v2.0/core/get_jobswith filter[["method","=","smart.test"]]POST /api/v2.0/smart/testaccepts{disks: [devname], type: "SHORT"|"LONG"}
- Check that disk names match expected format (TrueNAS CORE uses
ada0,da0, etc. — notsda)- You may need to update mock drive names back or adjust poller logic
- Delete
app.dbto clear mock drive rows before first real poll
Feature Reference (Stage 6b)
New Pages
| URL | Description |
|---|---|
/stats |
Analytics — pass rate by model, daily activity last 14 days |
/audit |
Audit log — last 200 events with drive/operator context |
/settings |
Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) |
/history/{id}/print |
Print-friendly job report with QR code |
New API Routes (6b + 6c)
| Method | Path | Description |
|---|---|---|
PATCH |
/api/v1/drives/{id} |
Update notes and/or location |
POST |
/api/v1/settings |
Save runtime settings to /data/settings_overrides.json |
POST |
/api/v1/settings/test-smtp |
Test SMTP connection without sending email |
Notifications
- Browser push: Bell icon in header →
Notification.requestPermission(). Fires onjob-alertSSE event (burnin pass/fail). - SSE alert event:
job-alertevent type on/sse/drives. JS listens viahtmx:sseMessage. - Immediate email:
send_job_alert()in mailer.py. Triggered bynotifier.notify_job_complete()from burnin.py. - Webhook:
notifier._send_webhook()— POST JSON toWEBHOOK_URL. Payload includes event, job_id, devname, serial, model, state, operator, error_text.
Stuck Job Detection
burnin.check_stuck_jobs()runs every 5 poll cycles (~1 min)- Jobs running longer than
STUCK_JOB_HOURS(default 24h) → state=unknown - Logged at CRITICAL level; audit event written
Batch Burn-In
- Checkboxes on each idle/selectable drive row
- Batch bar appears in filter row when any drives selected
- Uses existing
POST /api/v1/burnin/startwith multipledrive_ids - Requires operator name + explicit confirmation checkbox (no serial required)
- JS
checkedDriveIdsSet persists across SSE swaps viarestoreCheckboxes()
Drive Location
locationandnotesfields added to drives table via ALTER TABLE migration- Inline click-to-edit on location field in drive name cell
- Saves via
PATCH /api/v1/drives/{id}on blur/Enter; restores on Escape
Feature Reference (Stage 6c)
Settings Page
- Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
- Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
- All changes save instantly via
POST /api/v1/settings→settings_store.save()→/data/settings_overrides.json - Overrides loaded on startup in
main.pylifespan viasettings_store.init() - Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
- Test Connection button at top of SMTP card — tests live settings without sending email
- Brand logo in header is now a clickable
<a href="/">home link
SMTP Port Derivation
# mailer.py — port is derived from mode, NOT from settings.smtp_port
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
port = _MODE_PORTS.get(mode, 587)
Never use settings.smtp_port in mailer — it's kept in config for .env backward compat only.
Burn-In Stage Selection
StartBurninRequest no longer takes profile: str. Instead takes:
run_surface: bool = True— surface validate (destructive write test)run_short: bool = True— Short SMART (non-destructive)run_long: bool = True— Long SMART (non-destructive)
Profile string is computed as a property. Profiles: full, surface_short, surface_long,
surface, short_long, short, long. Precheck and final_check always run.
STAGE_ORDER in burnin.py has all 7 profile combinations.
_recalculate_progress() uses _STAGE_BASE_WEIGHTS dict (per-stage weights) and computes
overall % dynamically from actual burnin_stages rows — no profile lookup needed.
In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:
- Destructive warning is hidden
- Serial confirmation field is hidden (single modal)
- Confirmation checkbox is hidden (batch modal)
Table Scroll Fix
.table-wrap {
max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
}
If stats bar or other content height changes, update this offset.
Feature Reference (Stage 6d)
Cancel Functionality
| What | How |
|---|---|
| Cancel running Short SMART | ✕ Short button appears in action col when short_busy; calls POST /api/v1/drives/{id}/smart/cancel with {type:"short"} |
| Cancel running Long SMART | ✕ Long button appears when long_busy; same route with {type:"long"} |
| Cancel individual burn-in | ✕ Burn-In button (was "Cancel") shown when bi_active; calls POST /api/v1/burnin/{id}/cancel |
| Cancel All Running | Red ✕ Cancel All Burn-Ins button appears in filter bar when any burn-in jobs are active; JS collects all .btn-cancel[data-job-id] and cancels each |
SMART cancel route (POST /api/v1/drives/{drive_id}/smart/cancel):
- Fetches all running TrueNAS jobs via
client.get_smart_jobs() - Finds job where
arguments[0].diskscontains the drive's devname - Calls
client.abort_job(tn_job_id) - Updates
smart_teststable row tostate='aborted'
Stage Reordering
- Default order changed to: Short SMART → Long SMART → Surface Validate (non-destructive first)
- Drag handles (⠿) on each stage row in both single and batch modals
- HTML5 drag-and-drop, no external library
getStageOrder(listId)reads current DOM order of checked stagesstage_order: ["short_smart","long_smart","surface_validate"]sent in API bodyStartBurninRequest.stage_order: list[str] | None— validated against allowed stage namesburnin.start_job()acceptsstage_orderparam; builds:["precheck"] + stage_order + ["final_check"]_run_job()reads stage names back fromburnin_stages ORDER BY id— so custom order is honoured- Destructive warning / serial confirmation still triggered by
stage-surfacecheckbox ID (order-independent)
NPM / DNS Setup
- Proxy host:
burnin.hellocomputer.xyz→http://10.0.0.138:8080 - Authelia protection: recommended (no built-in auth in app)
- DNS:
burnin.hellocomputer.xyzCNAME →sandon.hellocomputer.xyz(proxied: false)