truenas-burnin/CLAUDE.md
Brandon Walter fc33c0d11e docs: update CLAUDE.md for Stage 7; bump version to 1.0.0-7
Documents all Stage 7 features: SSH burn-in architecture, SMART attr
monitoring, drive reset, version badge, stats polish, new env vars,
new API routes, and real-TrueNAS cutover steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 08:13:21 -05:00

26 KiB
Raw Blame History

TrueNAS Burn-In Dashboard — Project Context

Drop this file in any new Claude session to resume work with full context. Last updated: 2026-02-24 (Stage 7)


What This Is

A self-hosted web dashboard for running and tracking hard-drive burn-in tests against a TrueNAS CORE instance. Deployed on maple.local (10.0.0.138).

Stages completed

Stage Description Status
1 Mock TrueNAS CORE v2.0 API (15 drives, sdasdo)
2 Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client)
3 Dashboard UI (Jinja2, SSE live updates, dark theme)
4 Burn-in orchestrator (queue, concurrency, start/cancel)
5 History page, job detail page, CSV export
6 Hardening (retries, JSON logging, IP allowlist, poller watchdog)
6b UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics)
6c Settings overhaul (editable form, runtime store, SMTP fix, stage selection)
6d Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals
7 SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish

File Map

truenas-burnin/
├── docker-compose.yml          # two services: mock-truenas + app
├── Dockerfile                  # app container
├── requirements.txt
├── .env.example
├── data/                       # SQLite DB lives here (gitignored, created on deploy)
│
├── mock-truenas/
│   ├── Dockerfile
│   └── app.py                  # FastAPI mock of TrueNAS CORE v2.0 REST API
│
└── app/
    ├── __init__.py
    ├── config.py               # pydantic-settings; reads .env
    ├── database.py             # schema, migrations, init_db(), get_db()
    ├── models.py               # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
    ├── settings_store.py       # runtime settings store — persists to /data/settings_overrides.json
    ├── ssh_client.py           # asyncssh client: smartctl parsing, badblocks streaming, test_connection
    ├── truenas.py              # httpx async client with retry (lambda factory pattern)
    ├── poller.py               # poll loop, SSE pub/sub, stale detection, stuck-job check
    ├── burnin.py               # orchestrator, semaphore, stages, check_stuck_jobs()
    ├── notifier.py             # webhook + immediate email alerts on job completion
    ├── mailer.py               # daily HTML email + per-job alert email
    ├── logging_config.py       # structured JSON logging
    ├── renderer.py             # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
    ├── routes.py               # all FastAPI route handlers
    ├── main.py                 # app factory, IP allowlist middleware, lifespan
    │
    ├── static/
    │   ├── app.css             # full dark theme + mobile responsive
    │   └── app.js              # push notifications, batch, elapsed timers, inline edit
    │
    └── templates/
        ├── layout.html         # header nav: History, Stats, Audit, Settings, bell button
        ├── dashboard.html      # stats bar, failed banner, batch bar
        ├── history.html
        ├── job_detail.html     # + Print/Export button
        ├── audit.html          # audit event log
        ├── stats.html          # analytics: pass rate by model, daily activity, duration by size, failures by stage
        ├── settings.html       # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
        ├── job_print.html      # print view with client-side QR code (qrcodejs CDN)
        └── components/
            ├── drives_table.html   # checkboxes, elapsed time, location inline edit
            ├── modal_start.html    # single-drive burn-in modal
            └── modal_batch.html    # batch burn-in modal

Architecture Overview

Browser  ──HTMX SSE──▶  GET /sse/drives
                              │
                         poller.subscribe()
                              │
                         asyncio.Queue  ◀─── poller.run() notifies after each poll
                              │                    & after each burnin stage update
                         render drives_table.html
                         yield SSE "drives-update" event
  • Poller (poller.py): runs every POLL_INTERVAL_SECONDS (default 12s), calls TrueNAS /api/v2.0/disk and /api/v2.0/core/get_jobs, writes to SQLite, notifies SSE subscribers
  • Burn-in (burnin.py): asyncio.Semaphore(max_parallel_burnins) gates concurrency. Jobs are created immediately (queued state), semaphore gates actual execution. On startup, any interrupted running jobs → state=unknown; queued jobs are re-enqueued.
  • SSE (routes.py /sse/drives): one persistent connection per browser tab. Renders fresh drives_table.html HTML fragment on every notification.
  • HTMX (dashboard.html): hx-ext="sse" + sse-swap="drives-update" replaces #drives-tbody content without page reload.

Database Schema (SQLite WAL mode)

-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
        temperature_c, smart_health, last_polled_at)

-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
             state, percent, started_at, eta_at, finished_at, error_text,
             UNIQUE(drive_id, test_type))

-- burnin_jobs: one row per burn-in run (multiple per drive over time)
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
             failed/cancelled/unknown), percent, stage_name, operator,
             created_at, started_at, finished_at, error_text)

-- burnin_stages: one row per stage per job
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
               started_at, finished_at, error_text,
               log_text TEXT,        -- raw smartctl/badblocks SSH output
               bad_blocks INTEGER)   -- bad sector count from surface_validate

-- audit_events: append-only log
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)

-- drives columns added by migrations:
--   location TEXT, notes TEXT (Stage 6b)
--   smart_attrs TEXT            -- JSON blob of last SMART attribute snapshot (Stage 7)

-- smart_tests columns added by migrations:
--   raw_output TEXT             -- raw smartctl -a output (Stage 7)

Burn-In Stage Definitions

STAGE_ORDER = {
    "quick": ["precheck", "short_smart", "io_validate",  "final_check"],
    "full":  ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
}

The UI only exposes full profile (destructive). Quick profile exists for dev/testing.


TrueNAS API Contracts Used

Method Endpoint Notes
GET /api/v2.0/disk List all disks
POST /api/v2.0/smart/test Start SMART test {disks:[name], type:"SHORT"|"LONG"}
GET /api/v2.0/core/get_jobs Filter [["method","=","smart.test"]]
POST /api/v2.0/core/job_abort job_id positional arg
GET /api/v2.0/smart/test/results/{disk} Per-disk SMART results

Auth: Authorization: Bearer {TRUENAS_API_KEY} header.


Config / Environment Variables

All read from .env via pydantic-settings. See .env.example for full list.

Variable Default Notes
APP_HOST 0.0.0.0
APP_PORT 8080
DB_PATH /data/app.db Inside container
TRUENAS_BASE_URL http://localhost:8000 Point at mock or real TrueNAS
TRUENAS_API_KEY mock-key Real API key for prod
TRUENAS_VERIFY_TLS false Set true for prod with valid cert
POLL_INTERVAL_SECONDS 12
STALE_THRESHOLD_SECONDS 45 UI shows warning if data older than this
MAX_PARALLEL_BURNINS 2 asyncio.Semaphore limit
SURFACE_VALIDATE_SECONDS 45 Mock only — duration of surface stage
IO_VALIDATE_SECONDS 25 Mock only — duration of I/O stage
STUCK_JOB_HOURS 24 Hours before a running job is auto-marked unknown
LOG_LEVEL INFO
ALLOWED_IPS `` Empty = allow all. Comma-sep IPs/CIDRs
SMTP_HOST `` Empty = email disabled
SMTP_PORT 587
SMTP_USER ``
SMTP_PASSWORD ``
SMTP_FROM ``
SMTP_TO `` Comma-separated
SMTP_REPORT_HOUR 8 Local hour (0-23) to send daily report
SMTP_ALERT_ON_FAIL true Immediate email when a job fails
SMTP_ALERT_ON_PASS false Immediate email when a job passes
WEBHOOK_URL `` POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n
TEMP_WARN_C 46 Temperature warning threshold (°C)
TEMP_CRIT_C 55 Temperature critical threshold — precheck fails above this
BAD_BLOCK_THRESHOLD 0 Max bad blocks allowed before surface_validate fails (0 = any bad = fail)
APP_VERSION 1.0.0-7 Displayed in header version badge
SSH_HOST `` TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST)
SSH_PORT 22 TrueNAS SSH port
SSH_USER root TrueNAS SSH username
SSH_PASSWORD `` TrueNAS SSH password (use key instead for production)
SSH_KEY `` TrueNAS SSH private key PEM string — loaded in-memory, never written to disk

Deploy Workflow

First deploy (already done)

# On maple.local
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build

Redeploy after code changes

# Copy changed files from mac to maple.local first, e.g.:
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/

# Then on maple.local:
ssh brandon@10.0.0.138 -p 2225
cd ~/docker/stacks/truenas-burnin
docker compose up -d --build

Reset the database (e.g. after schema changes)

# On maple.local — stop containers first
docker compose stop app
# Delete DB using alpine (container owns the file, sudo not available)
docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db
docker compose start app

Check logs

docker compose logs -f app
docker compose logs -f mock-truenas

Mock TrueNAS Server (mock-truenas/app.py)

  • 15 drives: sdasdo
  • Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB, 2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB, 1× Seagate Barracuda ST6000DM003 6TB, 1× FAIL001 (sdn) — always fails at ~30%
  • SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
  • Debug endpoints:
    • POST /debug/reset — reset all jobs/state
    • GET /debug/state — dump current state
    • POST /debug/complete-all-jobs — instantly complete all running tests

Key Implementation Patterns

Retry pattern — lambda factory (NOT coroutine object)

# CORRECT: pass a factory so each retry creates a fresh coroutine
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")

# WRONG: coroutine is exhausted after first await, retry silently fails
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")

SSE template rendering

# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
yield {"event": "drives-update", "data": html}

Sticky thead scroll fix

/* BOTH axes required on table-wrap for position:sticky to work on thead */
.table-wrap {
  overflow: auto;           /* NOT overflow-x: auto */
  max-height: calc(100vh - 130px);
}
thead { position: sticky; top: 0; z-index: 10; }

export.csv route ordering

# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
@router.get("/api/v1/burnin/export.csv")   # first
async def burnin_export_csv(...): ...

@router.get("/api/v1/burnin/{job_id}")     # second
async def burnin_get(job_id: int, ...): ...

Known Issues / Past Bugs Fixed

Bug Root Cause Fix
_execute_stages used STAGE_ORDER[profile] ignoring custom order Stage order stored in DB but not read back _run_job reads stages from burnin_stages ORDER BY id; _execute_stages accepts stages: list[str]
Poller stuck at 'running' after completion _sync_history() had early-return guard when state=running Removed guard — _sync_history only called when job not in active dict
DB schema tables missing after edit Tables split into separate variable never passed to executescript() Put all tables in single SCHEMA string
Retry not retrying _with_retry(coro) — coroutine exhausted after first fail Changed to _with_retry(factory: Callable[[], Coroutine])
error_text overwritten _finish_stage(success=False) overwrote error set by stage handler _finish_stage omits error_text column in SQL when param is None
Cancelled stage showed 'failed' _execute_stages called _finish_stage(success=False) on cancel Check _is_cancelled(), call _cancel_stage() instead
export.csv returns 422 Route registered after /{job_id}, FastAPI tries int("export.csv") Move export route before parameterized route
Old drive names persist after mock rename Poller upserts by truenas_disk_id, old rows stay Delete app.db and restart
First row clipped behind sticky thead overflow-x: auto only creates partial stacking context Use overflow: auto (both axes) on .table-wrap
rm data/app.db permission denied Container owns the file Use docker run --rm -v .../data:/data alpine rm -f /data/app.db
First row clipped after Stage 6b Stats bar added 70px but max-height not updated max-height: calc(100vh - 205px)
SMTP "Connection unexpectedly closed" _send_email used settings.smtp_port (587 default) even in SSL mode Derive port from mode via _MODE_PORTS dict; SSL→465, STARTTLS→587, Plain→25
SSL mode missing EHLO smtplib.SMTP_SSL was created without calling ehlo() Added server.ehlo() after both SSL and STARTTLS connections
profile NameError in _execute_stages _execute_stages called _recalculate_progress(job_id, profile) but profile not in scope Changed to _recalculate_progress(job_id) — profile param was unused
app_version Jinja2 global rendered as function Set templates.env.globals["app_version"] = _get_app_version (callable) Set to the static string value directly: = _settings.app_version

Feature Reference (Stage 7)

SSH Burn-In Architecture

ssh_client.py provides an optional SSH execution layer. When SSH_HOST is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When SSH_HOST is empty, stages fall back to mock/REST simulation.

Dual-mode dispatch — each stage checks ssh_client.is_configured():

if ssh_client.is_configured():
    # run smartctl / badblocks over SSH
else:
    # simulate with REST API or timed sleep (mock mode)

SSH client capabilities (ssh_client.py):

  • test_connection(){"ok": bool, "error": str} — used by Test SSH button
  • get_smart_attributes(devname) → parse smartctl -a, return {health, raw_output, attributes, warnings, failures}
  • start_smart_test(devname, test_type)smartctl -t short|long /dev/{devname}
  • poll_smart_progress(devname)smartctl -a during test; returns {state, percent_remaining, output}
  • abort_smart_test(devname)smartctl -X /dev/{devname}
  • run_badblocks(devname, on_progress, cancelled_fn) → streams badblocks -wsv -b 4096 -p 1; counts bad sectors from stdout (digit-only lines)

Key auth pattern — key is stored as PEM string in settings, never written to disk:

asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)

badblocks streaming — uses asyncssh.create_process() with parallel stdout/stderr draining via asyncio.gather. Progress updates written to DB every 20 lines to avoid excessive writes.

SMART Attribute Monitoring

Monitored attributes and their thresholds:

ID Name Any non-zero →
5 Reallocated_Sector_Ct FAIL
10 Spin_Retry_Count WARN
188 Command_Timeout WARN
197 Current_Pending_Sector FAIL
198 Offline_Uncorrectable FAIL
199 UDMA_CRC_Error_Count WARN

SMART attrs stored as JSON blob in drives.smart_attrs. Updated by final_check stage (SSH mode) or short_smart/long_smart REST mode. Displayed in drive drawer with colour-coded table + raw smartctl -a output.

Drive Reset Action

  • POST /api/v1/drives/{drive_id}/reset — clears smart_tests rows to idle, clears drives.smart_attrs, writes audit event, notifies SSE subscribers
  • Button appears in action column when can_reset = drive has no active burn-in AND has any non-idle smart state or smart attrs
  • Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state

New Routes (Stage 7)

Method Path Description
POST /api/v1/drives/{id}/reset Reset SMART state and attrs for a drive
POST /api/v1/settings/test-ssh Test SSH connection with current SSH settings
GET /api/v1/updates/check Check for latest release from Forgejo git.hellocomputer.xyz

Check for Updates

Settings page has a "Check for Updates" button that fetches:

GET https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest

Compares tag name against settings.app_version; shows "up to date" or "v{tag} available".

Version Badge

app_version set as Jinja2 global in renderer.py:

templates.env.globals["app_version"] = _settings.app_version

Displayed in header as <span class="header-version">v{app_version}</span> (right side, muted).

Configurable Thresholds

renderer.py _temp_class now reads from settings instead of hardcoded values:

if temp >= settings.temp_crit_c:  return "temp-crit"
if temp >= settings.temp_warn_c:  return "temp-warn"

precheck stage fails if temperature_c >= settings.temp_crit_c.

Surface validate fails if bad_blocks > settings.bad_block_threshold (default 0 = any bad sector = fail).

Cutting to Real TrueNAS (Next Steps)

When ready to test against a real TrueNAS CORE box:

  1. In Settings (or .env), set:
    • TrueNAS URLhttps://10.0.0.X (real IP)
    • API Key → real API key
    • SSH Host → same IP as TrueNAS
    • SSH Userroot (or sudoer with smartctl/badblocks access)
    • SSH Key → paste PEM key into textarea
  2. Click Test SSH Connection to verify before starting a burn-in
  3. TrueNAS CORE uses ada0, da0 device names (not sda). Mock drive names will differ.
  4. Delete app.db before first real poll to clear mock drive rows
  5. Comment out mock-truenas service in docker-compose.yml (optional — harmless to leave)
  6. Verify TrueNAS CORE v2.0 REST API:
    • GET /api/v2.0/disk returns list with name, serial, model, size, temperature
    • GET /api/v2.0/core/get_jobs with filter [["method","=","smart.test"]]
    • POST /api/v2.0/smart/test accepts {disks: [devname], type: "SHORT"|"LONG"}

Feature Reference (Stage 6b)

New Pages

URL Description
/stats Analytics — pass rate by model, daily activity last 14 days
/audit Audit log — last 200 events with drive/operator context
/settings Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook)
/history/{id}/print Print-friendly job report with QR code

New API Routes (6b + 6c)

Method Path Description
PATCH /api/v1/drives/{id} Update notes and/or location
POST /api/v1/settings Save runtime settings to /data/settings_overrides.json
POST /api/v1/settings/test-smtp Test SMTP connection without sending email

Notifications

  • Browser push: Bell icon in header → Notification.requestPermission(). Fires on job-alert SSE event (burnin pass/fail).
  • SSE alert event: job-alert event type on /sse/drives. JS listens via htmx:sseMessage.
  • Immediate email: send_job_alert() in mailer.py. Triggered by notifier.notify_job_complete() from burnin.py.
  • Webhook: notifier._send_webhook() — POST JSON to WEBHOOK_URL. Payload includes event, job_id, devname, serial, model, state, operator, error_text.

Stuck Job Detection

  • burnin.check_stuck_jobs() runs every 5 poll cycles (~1 min)
  • Jobs running longer than STUCK_JOB_HOURS (default 24h) → state=unknown
  • Logged at CRITICAL level; audit event written

Batch Burn-In

  • Checkboxes on each idle/selectable drive row
  • Batch bar appears in filter row when any drives selected
  • Uses existing POST /api/v1/burnin/start with multiple drive_ids
  • Requires operator name + explicit confirmation checkbox (no serial required)
  • JS checkedDriveIds Set persists across SSE swaps via restoreCheckboxes()

Drive Location

  • location and notes fields added to drives table via ALTER TABLE migration
  • Inline click-to-edit on location field in drive name cell
  • Saves via PATCH /api/v1/drives/{id} on blur/Enter; restores on Escape

Feature Reference (Stage 6c)

Settings Page

  • Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
  • Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
  • All changes save instantly via POST /api/v1/settingssettings_store.save()/data/settings_overrides.json
  • Overrides loaded on startup in main.py lifespan via settings_store.init()
  • Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
  • Test Connection button at top of SMTP card — tests live settings without sending email
  • Brand logo in header is now a clickable <a href="/"> home link

SMTP Port Derivation

# mailer.py — port is derived from mode, NOT from settings.smtp_port
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
port = _MODE_PORTS.get(mode, 587)

Never use settings.smtp_port in mailer — it's kept in config for .env backward compat only.

Burn-In Stage Selection

StartBurninRequest no longer takes profile: str. Instead takes:

  • run_surface: bool = True — surface validate (destructive write test)
  • run_short: bool = True — Short SMART (non-destructive)
  • run_long: bool = True — Long SMART (non-destructive)

Profile string is computed as a property. Profiles: full, surface_short, surface_long, surface, short_long, short, long. Precheck and final_check always run.

STAGE_ORDER in burnin.py has all 7 profile combinations.

_recalculate_progress() uses _STAGE_BASE_WEIGHTS dict (per-stage weights) and computes overall % dynamically from actual burnin_stages rows — no profile lookup needed.

In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:

  • Destructive warning is hidden
  • Serial confirmation field is hidden (single modal)
  • Confirmation checkbox is hidden (batch modal)

Table Scroll Fix

.table-wrap {
  max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
}

If stats bar or other content height changes, update this offset.

Feature Reference (Stage 6d)

Cancel Functionality

What How
Cancel running Short SMART ✕ Short button appears in action col when short_busy; calls POST /api/v1/drives/{id}/smart/cancel with {type:"short"}
Cancel running Long SMART ✕ Long button appears when long_busy; same route with {type:"long"}
Cancel individual burn-in ✕ Burn-In button (was "Cancel") shown when bi_active; calls POST /api/v1/burnin/{id}/cancel
Cancel All Running Red ✕ Cancel All Burn-Ins button appears in filter bar when any burn-in jobs are active; JS collects all .btn-cancel[data-job-id] and cancels each

SMART cancel route (POST /api/v1/drives/{drive_id}/smart/cancel):

  1. Fetches all running TrueNAS jobs via client.get_smart_jobs()
  2. Finds job where arguments[0].disks contains the drive's devname
  3. Calls client.abort_job(tn_job_id)
  4. Updates smart_tests table row to state='aborted'

Stage Reordering

  • Default order changed to: Short SMART → Long SMART → Surface Validate (non-destructive first)
  • Drag handles (⠿) on each stage row in both single and batch modals
  • HTML5 drag-and-drop, no external library
  • getStageOrder(listId) reads current DOM order of checked stages
  • stage_order: ["short_smart","long_smart","surface_validate"] sent in API body
  • StartBurninRequest.stage_order: list[str] | None — validated against allowed stage names
  • burnin.start_job() accepts stage_order param; builds: ["precheck"] + stage_order + ["final_check"]
  • _run_job() reads stage names back from burnin_stages ORDER BY id — so custom order is honoured
  • Destructive warning / serial confirmation still triggered by stage-surface checkbox ID (order-independent)

NPM / DNS Setup

  • Proxy host: burnin.hellocomputer.xyzhttp://10.0.0.138:8080
  • Authelia protection: recommended (no built-in auth in app)
  • DNS: burnin.hellocomputer.xyz CNAME → sandon.hellocomputer.xyz (proxied: false)