Documents all Stage 7 features: SSH burn-in architecture, SMART attr monitoring, drive reset, version badge, stats polish, new env vars, new API routes, and real-TrueNAS cutover steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
555 lines
26 KiB
Markdown
555 lines
26 KiB
Markdown
# TrueNAS Burn-In Dashboard — Project Context
|
||
|
||
> Drop this file in any new Claude session to resume work with full context.
|
||
> Last updated: 2026-02-24 (Stage 7)
|
||
|
||
---
|
||
|
||
## What This Is
|
||
|
||
A self-hosted web dashboard for running and tracking hard-drive burn-in tests
|
||
against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138).
|
||
|
||
- **App URL**: http://10.0.0.138:8084 (or http://burnin.hellocomputer.xyz)
|
||
- **Stack path on maple.local**: `~/docker/stacks/truenas-burnin/`
|
||
- **Source (local mac)**: `~/Desktop/claude-sandbox/truenas-burnin/`
|
||
- **Compose synced to maple.local** via `scp` or manual copy
|
||
|
||
### Stages completed
|
||
|
||
| Stage | Description | Status |
|
||
|-------|-------------|--------|
|
||
| 1 | Mock TrueNAS CORE v2.0 API (15 drives, sda–sdo) | ✅ |
|
||
| 2 | Backend core (FastAPI, SQLite/WAL, poller, TrueNAS client) | ✅ |
|
||
| 3 | Dashboard UI (Jinja2, SSE live updates, dark theme) | ✅ |
|
||
| 4 | Burn-in orchestrator (queue, concurrency, start/cancel) | ✅ |
|
||
| 5 | History page, job detail page, CSV export | ✅ |
|
||
| 6 | Hardening (retries, JSON logging, IP allowlist, poller watchdog) | ✅ |
|
||
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
|
||
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
|
||
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
|
||
| 7 | SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish | ✅ |
|
||
|
||
---
|
||
|
||
## File Map
|
||
|
||
```
|
||
truenas-burnin/
|
||
├── docker-compose.yml # two services: mock-truenas + app
|
||
├── Dockerfile # app container
|
||
├── requirements.txt
|
||
├── .env.example
|
||
├── data/ # SQLite DB lives here (gitignored, created on deploy)
|
||
│
|
||
├── mock-truenas/
|
||
│ ├── Dockerfile
|
||
│ └── app.py # FastAPI mock of TrueNAS CORE v2.0 REST API
|
||
│
|
||
└── app/
|
||
├── __init__.py
|
||
├── config.py # pydantic-settings; reads .env
|
||
├── database.py # schema, migrations, init_db(), get_db()
|
||
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
|
||
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
|
||
├── ssh_client.py # asyncssh client: smartctl parsing, badblocks streaming, test_connection
|
||
├── truenas.py # httpx async client with retry (lambda factory pattern)
|
||
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
|
||
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
|
||
├── notifier.py # webhook + immediate email alerts on job completion
|
||
├── mailer.py # daily HTML email + per-job alert email
|
||
├── logging_config.py # structured JSON logging
|
||
├── renderer.py # Jinja2 + filters (format_bytes, format_eta, format_elapsed, …)
|
||
├── routes.py # all FastAPI route handlers
|
||
├── main.py # app factory, IP allowlist middleware, lifespan
|
||
│
|
||
├── static/
|
||
│ ├── app.css # full dark theme + mobile responsive
|
||
│ └── app.js # push notifications, batch, elapsed timers, inline edit
|
||
│
|
||
└── templates/
|
||
├── layout.html # header nav: History, Stats, Audit, Settings, bell button
|
||
├── dashboard.html # stats bar, failed banner, batch bar
|
||
├── history.html
|
||
├── job_detail.html # + Print/Export button
|
||
├── audit.html # audit event log
|
||
├── stats.html # analytics: pass rate by model, daily activity, duration by size, failures by stage
|
||
├── settings.html # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
|
||
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
|
||
└── components/
|
||
├── drives_table.html # checkboxes, elapsed time, location inline edit
|
||
├── modal_start.html # single-drive burn-in modal
|
||
└── modal_batch.html # batch burn-in modal
|
||
```
|
||
|
||
---
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
Browser ──HTMX SSE──▶ GET /sse/drives
|
||
│
|
||
poller.subscribe()
|
||
│
|
||
asyncio.Queue ◀─── poller.run() notifies after each poll
|
||
│ & after each burnin stage update
|
||
render drives_table.html
|
||
yield SSE "drives-update" event
|
||
```
|
||
|
||
- **Poller** (`poller.py`): runs every `POLL_INTERVAL_SECONDS` (default 12s), calls
|
||
TrueNAS `/api/v2.0/disk` and `/api/v2.0/core/get_jobs`, writes to SQLite,
|
||
notifies SSE subscribers
|
||
- **Burn-in** (`burnin.py`): `asyncio.Semaphore(max_parallel_burnins)` gates
|
||
concurrency. Jobs are created immediately (queued state), semaphore gates
|
||
actual execution. On startup, any interrupted running jobs → state=unknown;
|
||
queued jobs are re-enqueued.
|
||
- **SSE** (`routes.py /sse/drives`): one persistent connection per browser tab.
|
||
Renders fresh `drives_table.html` HTML fragment on every notification.
|
||
- **HTMX** (`dashboard.html`): `hx-ext="sse"` + `sse-swap="drives-update"`
|
||
replaces `#drives-tbody` content without page reload.
|
||
|
||
---
|
||
|
||
## Database Schema (SQLite WAL mode)
|
||
|
||
```sql
|
||
-- drives: upsert by truenas_disk_id (the TrueNAS internal disk identifier)
|
||
drives (id, truenas_disk_id UNIQUE, devname, serial, model, size_bytes,
|
||
temperature_c, smart_health, last_polled_at)
|
||
|
||
-- smart_tests: one row per drive+test_type combination (UNIQUE constraint)
|
||
smart_tests (id, drive_id FK, test_type CHECK('short','long'),
|
||
state, percent, started_at, eta_at, finished_at, error_text,
|
||
UNIQUE(drive_id, test_type))
|
||
|
||
-- burnin_jobs: one row per burn-in run (multiple per drive over time)
|
||
burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
|
||
failed/cancelled/unknown), percent, stage_name, operator,
|
||
created_at, started_at, finished_at, error_text)
|
||
|
||
-- burnin_stages: one row per stage per job
|
||
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
|
||
started_at, finished_at, error_text,
|
||
log_text TEXT, -- raw smartctl/badblocks SSH output
|
||
bad_blocks INTEGER) -- bad sector count from surface_validate
|
||
|
||
-- audit_events: append-only log
|
||
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
|
||
|
||
-- drives columns added by migrations:
|
||
-- location TEXT, notes TEXT (Stage 6b)
|
||
-- smart_attrs TEXT -- JSON blob of last SMART attribute snapshot (Stage 7)
|
||
|
||
-- smart_tests columns added by migrations:
|
||
-- raw_output TEXT -- raw smartctl -a output (Stage 7)
|
||
```
|
||
|
||
---
|
||
|
||
## Burn-In Stage Definitions
|
||
|
||
```python
|
||
STAGE_ORDER = {
|
||
"quick": ["precheck", "short_smart", "io_validate", "final_check"],
|
||
"full": ["precheck", "surface_validate", "short_smart", "long_smart", "final_check"],
|
||
}
|
||
```
|
||
|
||
The UI only exposes **full** profile (destructive). Quick profile exists for dev/testing.
|
||
|
||
---
|
||
|
||
## TrueNAS API Contracts Used
|
||
|
||
| Method | Endpoint | Notes |
|
||
|--------|----------|-------|
|
||
| GET | `/api/v2.0/disk` | List all disks |
|
||
| POST | `/api/v2.0/smart/test` | Start SMART test `{disks:[name], type:"SHORT"\|"LONG"}` |
|
||
| GET | `/api/v2.0/core/get_jobs` | Filter `[["method","=","smart.test"]]` |
|
||
| POST | `/api/v2.0/core/job_abort` | `job_id` positional arg |
|
||
| GET | `/api/v2.0/smart/test/results/{disk}` | Per-disk SMART results |
|
||
|
||
Auth: `Authorization: Bearer {TRUENAS_API_KEY}` header.
|
||
|
||
---
|
||
|
||
## Config / Environment Variables
|
||
|
||
All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
|
||
|
||
| Variable | Default | Notes |
|
||
|----------|---------|-------|
|
||
| `APP_HOST` | `0.0.0.0` | |
|
||
| `APP_PORT` | `8080` | |
|
||
| `DB_PATH` | `/data/app.db` | Inside container |
|
||
| `TRUENAS_BASE_URL` | `http://localhost:8000` | Point at mock or real TrueNAS |
|
||
| `TRUENAS_API_KEY` | `mock-key` | Real API key for prod |
|
||
| `TRUENAS_VERIFY_TLS` | `false` | Set true for prod with valid cert |
|
||
| `POLL_INTERVAL_SECONDS` | `12` | |
|
||
| `STALE_THRESHOLD_SECONDS` | `45` | UI shows warning if data older than this |
|
||
| `MAX_PARALLEL_BURNINS` | `2` | asyncio.Semaphore limit |
|
||
| `SURFACE_VALIDATE_SECONDS` | `45` | Mock only — duration of surface stage |
|
||
| `IO_VALIDATE_SECONDS` | `25` | Mock only — duration of I/O stage |
|
||
| `STUCK_JOB_HOURS` | `24` | Hours before a running job is auto-marked unknown |
|
||
| `LOG_LEVEL` | `INFO` | |
|
||
| `ALLOWED_IPS` | `` | Empty = allow all. Comma-sep IPs/CIDRs |
|
||
| `SMTP_HOST` | `` | Empty = email disabled |
|
||
| `SMTP_PORT` | `587` | |
|
||
| `SMTP_USER` | `` | |
|
||
| `SMTP_PASSWORD` | `` | |
|
||
| `SMTP_FROM` | `` | |
|
||
| `SMTP_TO` | `` | Comma-separated |
|
||
| `SMTP_REPORT_HOUR` | `8` | Local hour (0-23) to send daily report |
|
||
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
|
||
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
|
||
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
|
||
| `TEMP_WARN_C` | `46` | Temperature warning threshold (°C) |
|
||
| `TEMP_CRIT_C` | `55` | Temperature critical threshold — precheck fails above this |
|
||
| `BAD_BLOCK_THRESHOLD` | `0` | Max bad blocks allowed before surface_validate fails (0 = any bad = fail) |
|
||
| `APP_VERSION` | `1.0.0-7` | Displayed in header version badge |
|
||
| `SSH_HOST` | `` | TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST) |
|
||
| `SSH_PORT` | `22` | TrueNAS SSH port |
|
||
| `SSH_USER` | `root` | TrueNAS SSH username |
|
||
| `SSH_PASSWORD` | `` | TrueNAS SSH password (use key instead for production) |
|
||
| `SSH_KEY` | `` | TrueNAS SSH private key PEM string — loaded in-memory, never written to disk |
|
||
|
||
---
|
||
|
||
## Deploy Workflow
|
||
|
||
### First deploy (already done)
|
||
```bash
|
||
# On maple.local
|
||
cd ~/docker/stacks/truenas-burnin
|
||
docker compose up -d --build
|
||
```
|
||
|
||
### Redeploy after code changes
|
||
```bash
|
||
# Copy changed files from mac to maple.local first, e.g.:
|
||
scp -P 2225 -r app/ brandon@10.0.0.138:~/docker/stacks/truenas-burnin/
|
||
|
||
# Then on maple.local:
|
||
ssh brandon@10.0.0.138 -p 2225
|
||
cd ~/docker/stacks/truenas-burnin
|
||
docker compose up -d --build
|
||
```
|
||
|
||
### Reset the database (e.g. after schema changes)
|
||
```bash
|
||
# On maple.local — stop containers first
|
||
docker compose stop app
|
||
# Delete DB using alpine (container owns the file, sudo not available)
|
||
docker run --rm -v ~/docker/stacks/truenas-burnin/data:/data alpine rm -f /data/app.db
|
||
docker compose start app
|
||
```
|
||
|
||
### Check logs
|
||
```bash
|
||
docker compose logs -f app
|
||
docker compose logs -f mock-truenas
|
||
```
|
||
|
||
---
|
||
|
||
## Mock TrueNAS Server (`mock-truenas/app.py`)
|
||
|
||
- 15 drives: `sda`–`sdo`
|
||
- Drive mix: 3× ST12000NM0008 12TB, 3× WD80EFAX 8TB, 2× ST16000NM001G 16TB,
|
||
2× ST4000VN008 4TB, 2× TOSHIBA MG06ACA10TE 10TB, 1× HGST HUS728T8TAL5200 8TB,
|
||
1× Seagate Barracuda ST6000DM003 6TB, 1× **FAIL001** (sdn) — always fails at ~30%
|
||
- SHORT test: 90s simulated; LONG test: 480s simulated; tick every 5s
|
||
- Debug endpoints:
|
||
- `POST /debug/reset` — reset all jobs/state
|
||
- `GET /debug/state` — dump current state
|
||
- `POST /debug/complete-all-jobs` — instantly complete all running tests
|
||
|
||
---
|
||
|
||
## Key Implementation Patterns
|
||
|
||
### Retry pattern — lambda factory (NOT coroutine object)
|
||
```python
|
||
# CORRECT: pass a factory so each retry creates a fresh coroutine
|
||
r = await _with_retry(lambda: self._client.get("/api/v2.0/disk"), "get_disks")
|
||
|
||
# WRONG: coroutine is exhausted after first await, retry silently fails
|
||
r = await _with_retry(self._client.get("/api/v2.0/disk"), "get_disks")
|
||
```
|
||
|
||
### SSE template rendering
|
||
```python
|
||
# Use templates.env.get_template().render() — not TemplateResponse (that's a Response object)
|
||
html = templates.env.get_template("components/drives_table.html").render(drives=drives)
|
||
yield {"event": "drives-update", "data": html}
|
||
```
|
||
|
||
### Sticky thead scroll fix
|
||
```css
|
||
/* BOTH axes required on table-wrap for position:sticky to work on thead */
|
||
.table-wrap {
|
||
overflow: auto; /* NOT overflow-x: auto */
|
||
max-height: calc(100vh - 130px);
|
||
}
|
||
thead { position: sticky; top: 0; z-index: 10; }
|
||
```
|
||
|
||
### export.csv route ordering
|
||
```python
|
||
# MUST register export.csv BEFORE /{job_id} — FastAPI tries int() on "export.csv"
|
||
@router.get("/api/v1/burnin/export.csv") # first
|
||
async def burnin_export_csv(...): ...
|
||
|
||
@router.get("/api/v1/burnin/{job_id}") # second
|
||
async def burnin_get(job_id: int, ...): ...
|
||
```
|
||
|
||
---
|
||
|
||
## Known Issues / Past Bugs Fixed
|
||
|
||
| Bug | Root Cause | Fix |
|
||
|-----|-----------|-----|
|
||
| `_execute_stages` used `STAGE_ORDER[profile]` ignoring custom order | Stage order stored in DB but not read back | `_run_job` reads stages from `burnin_stages ORDER BY id`; `_execute_stages` accepts `stages: list[str]` |
|
||
| Poller stuck at 'running' after completion | `_sync_history()` had early-return guard when state=running | Removed guard — `_sync_history` only called when job not in active dict |
|
||
| DB schema tables missing after edit | Tables split into separate variable never passed to `executescript()` | Put all tables in single `SCHEMA` string |
|
||
| Retry not retrying | `_with_retry(coro)` — coroutine exhausted after first fail | Changed to `_with_retry(factory: Callable[[], Coroutine])` |
|
||
| `error_text` overwritten | `_finish_stage(success=False)` overwrote error set by stage handler | `_finish_stage` omits `error_text` column in SQL when param is None |
|
||
| Cancelled stage showed 'failed' | `_execute_stages` called `_finish_stage(success=False)` on cancel | Check `_is_cancelled()`, call `_cancel_stage()` instead |
|
||
| export.csv returns 422 | Route registered after `/{job_id}`, FastAPI tries `int("export.csv")` | Move export route before parameterized route |
|
||
| Old drive names persist after mock rename | Poller upserts by `truenas_disk_id`, old rows stay | Delete `app.db` and restart |
|
||
| First row clipped behind sticky thead | `overflow-x: auto` only creates partial stacking context | Use `overflow: auto` (both axes) on `.table-wrap` |
|
||
| `rm data/app.db` permission denied | Container owns the file | Use `docker run --rm -v .../data:/data alpine rm -f /data/app.db` |
|
||
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
|
||
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
|
||
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
|
||
| `profile` NameError in `_execute_stages` | `_execute_stages` called `_recalculate_progress(job_id, profile)` but `profile` not in scope | Changed to `_recalculate_progress(job_id)` — profile param was unused |
|
||
| `app_version` Jinja2 global rendered as function | Set `templates.env.globals["app_version"] = _get_app_version` (callable) | Set to the static string value directly: `= _settings.app_version` |
|
||
|
||
---
|
||
|
||
## Feature Reference (Stage 7)
|
||
|
||
### SSH Burn-In Architecture
|
||
|
||
`ssh_client.py` provides an optional SSH execution layer. When `SSH_HOST` is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When `SSH_HOST` is empty, stages fall back to mock/REST simulation.
|
||
|
||
**Dual-mode dispatch** — each stage checks `ssh_client.is_configured()`:
|
||
```python
|
||
if ssh_client.is_configured():
|
||
# run smartctl / badblocks over SSH
|
||
else:
|
||
# simulate with REST API or timed sleep (mock mode)
|
||
```
|
||
|
||
**SSH client capabilities** (`ssh_client.py`):
|
||
- `test_connection()` → `{"ok": bool, "error": str}` — used by Test SSH button
|
||
- `get_smart_attributes(devname)` → parse `smartctl -a`, return `{health, raw_output, attributes, warnings, failures}`
|
||
- `start_smart_test(devname, test_type)` → `smartctl -t short|long /dev/{devname}`
|
||
- `poll_smart_progress(devname)` → `smartctl -a` during test; returns `{state, percent_remaining, output}`
|
||
- `abort_smart_test(devname)` → `smartctl -X /dev/{devname}`
|
||
- `run_badblocks(devname, on_progress, cancelled_fn)` → streams `badblocks -wsv -b 4096 -p 1`; counts bad sectors from stdout (digit-only lines)
|
||
|
||
**Key auth pattern** — key is stored as PEM string in settings, never written to disk:
|
||
```python
|
||
asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)
|
||
```
|
||
|
||
**badblocks streaming** — uses `asyncssh.create_process()` with parallel stdout/stderr draining via `asyncio.gather`. Progress updates written to DB every 20 lines to avoid excessive writes.
|
||
|
||
### SMART Attribute Monitoring
|
||
|
||
Monitored attributes and their thresholds:
|
||
|
||
| ID | Name | Any non-zero → |
|
||
|----|------|----------------|
|
||
| 5 | Reallocated_Sector_Ct | FAIL |
|
||
| 10 | Spin_Retry_Count | WARN |
|
||
| 188 | Command_Timeout | WARN |
|
||
| 197 | Current_Pending_Sector | FAIL |
|
||
| 198 | Offline_Uncorrectable | FAIL |
|
||
| 199 | UDMA_CRC_Error_Count | WARN |
|
||
|
||
SMART attrs stored as JSON blob in `drives.smart_attrs`. Updated by `final_check` stage (SSH mode) or `short_smart`/`long_smart` REST mode. Displayed in drive drawer with colour-coded table + raw `smartctl -a` output.
|
||
|
||
### Drive Reset Action
|
||
|
||
- `POST /api/v1/drives/{drive_id}/reset` — clears `smart_tests` rows to idle, clears `drives.smart_attrs`, writes audit event, notifies SSE subscribers
|
||
- Button appears in action column when `can_reset` = drive has no active burn-in AND has any non-idle smart state or smart attrs
|
||
- Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state
|
||
|
||
### New Routes (Stage 7)
|
||
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `POST` | `/api/v1/drives/{id}/reset` | Reset SMART state and attrs for a drive |
|
||
| `POST` | `/api/v1/settings/test-ssh` | Test SSH connection with current SSH settings |
|
||
| `GET` | `/api/v1/updates/check` | Check for latest release from Forgejo git.hellocomputer.xyz |
|
||
|
||
### Check for Updates
|
||
|
||
Settings page has a "Check for Updates" button that fetches:
|
||
```
|
||
GET https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest
|
||
```
|
||
Compares tag name against `settings.app_version`; shows "up to date" or "v{tag} available".
|
||
|
||
### Version Badge
|
||
|
||
`app_version` set as Jinja2 global in `renderer.py`:
|
||
```python
|
||
templates.env.globals["app_version"] = _settings.app_version
|
||
```
|
||
Displayed in header as `<span class="header-version">v{app_version}</span>` (right side, muted).
|
||
|
||
### Configurable Thresholds
|
||
|
||
`renderer.py` `_temp_class` now reads from settings instead of hardcoded values:
|
||
```python
|
||
if temp >= settings.temp_crit_c: return "temp-crit"
|
||
if temp >= settings.temp_warn_c: return "temp-warn"
|
||
```
|
||
`precheck` stage fails if `temperature_c >= settings.temp_crit_c`.
|
||
|
||
Surface validate fails if `bad_blocks > settings.bad_block_threshold` (default 0 = any bad sector = fail).
|
||
|
||
### Cutting to Real TrueNAS (Next Steps)
|
||
|
||
When ready to test against a real TrueNAS CORE box:
|
||
|
||
1. In Settings (or `.env`), set:
|
||
- **TrueNAS URL** → `https://10.0.0.X` (real IP)
|
||
- **API Key** → real API key
|
||
- **SSH Host** → same IP as TrueNAS
|
||
- **SSH User** → `root` (or sudoer with smartctl/badblocks access)
|
||
- **SSH Key** → paste PEM key into textarea
|
||
2. Click **Test SSH Connection** to verify before starting a burn-in
|
||
3. TrueNAS CORE uses `ada0`, `da0` device names (not `sda`). Mock drive names will differ.
|
||
4. Delete `app.db` before first real poll to clear mock drive rows
|
||
5. Comment out `mock-truenas` service in `docker-compose.yml` (optional — harmless to leave)
|
||
6. Verify TrueNAS CORE v2.0 REST API:
|
||
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
|
||
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
|
||
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
|
||
|
||
---
|
||
|
||
## Feature Reference (Stage 6b)
|
||
|
||
### New Pages
|
||
| URL | Description |
|
||
|-----|-------------|
|
||
| `/stats` | Analytics — pass rate by model, daily activity last 14 days |
|
||
| `/audit` | Audit log — last 200 events with drive/operator context |
|
||
| `/settings` | Editable 2-col settings form (SMTP, Notifications, Behavior, Webhook) |
|
||
| `/history/{id}/print` | Print-friendly job report with QR code |
|
||
|
||
### New API Routes (6b + 6c)
|
||
| Method | Path | Description |
|
||
|--------|------|-------------|
|
||
| `PATCH` | `/api/v1/drives/{id}` | Update `notes` and/or `location` |
|
||
| `POST` | `/api/v1/settings` | Save runtime settings to `/data/settings_overrides.json` |
|
||
| `POST` | `/api/v1/settings/test-smtp` | Test SMTP connection without sending email |
|
||
|
||
### Notifications
|
||
- **Browser push**: Bell icon in header → `Notification.requestPermission()`. Fires on `job-alert` SSE event (burnin pass/fail).
|
||
- **SSE alert event**: `job-alert` event type on `/sse/drives`. JS listens via `htmx:sseMessage`.
|
||
- **Immediate email**: `send_job_alert()` in mailer.py. Triggered by `notifier.notify_job_complete()` from burnin.py.
|
||
- **Webhook**: `notifier._send_webhook()` — POST JSON to `WEBHOOK_URL`. Payload includes event, job_id, devname, serial, model, state, operator, error_text.
|
||
|
||
### Stuck Job Detection
|
||
- `burnin.check_stuck_jobs()` runs every 5 poll cycles (~1 min)
|
||
- Jobs running longer than `STUCK_JOB_HOURS` (default 24h) → state=unknown
|
||
- Logged at CRITICAL level; audit event written
|
||
|
||
### Batch Burn-In
|
||
- Checkboxes on each idle/selectable drive row
|
||
- Batch bar appears in filter row when any drives selected
|
||
- Uses existing `POST /api/v1/burnin/start` with multiple `drive_ids`
|
||
- Requires operator name + explicit confirmation checkbox (no serial required)
|
||
- JS `checkedDriveIds` Set persists across SSE swaps via `restoreCheckboxes()`
|
||
|
||
### Drive Location
|
||
- `location` and `notes` fields added to drives table via ALTER TABLE migration
|
||
- Inline click-to-edit on location field in drive name cell
|
||
- Saves via `PATCH /api/v1/drives/{id}` on blur/Enter; restores on Escape
|
||
|
||
## Feature Reference (Stage 6c)
|
||
|
||
### Settings Page
|
||
- Two-column layout: SMTP card (left, wider) + Notifications / Behavior / Webhook stacked (right)
|
||
- Read-only system card at bottom (TrueNAS URL, poll interval, etc.) — restart required badge
|
||
- All changes save instantly via `POST /api/v1/settings` → `settings_store.save()` → `/data/settings_overrides.json`
|
||
- Overrides loaded on startup in `main.py` lifespan via `settings_store.init()`
|
||
- Connection mode dropdown auto-sets port: STARTTLS→587, SSL/TLS→465, Plain→25
|
||
- Test Connection button at top of SMTP card — tests live settings without sending email
|
||
- Brand logo in header is now a clickable `<a href="/">` home link
|
||
|
||
### SMTP Port Derivation
|
||
```python
|
||
# mailer.py — port is derived from mode, NOT from settings.smtp_port
|
||
_MODE_PORTS = {"starttls": 587, "ssl": 465, "plain": 25}
|
||
port = _MODE_PORTS.get(mode, 587)
|
||
```
|
||
Never use `settings.smtp_port` in mailer — it's kept in config for `.env` backward compat only.
|
||
|
||
### Burn-In Stage Selection
|
||
`StartBurninRequest` no longer takes `profile: str`. Instead takes:
|
||
- `run_surface: bool = True` — surface validate (destructive write test)
|
||
- `run_short: bool = True` — Short SMART (non-destructive)
|
||
- `run_long: bool = True` — Long SMART (non-destructive)
|
||
|
||
Profile string is computed as a property. Profiles: `full`, `surface_short`, `surface_long`,
|
||
`surface`, `short_long`, `short`, `long`. Precheck and final_check always run.
|
||
|
||
`STAGE_ORDER` in `burnin.py` has all 7 profile combinations.
|
||
|
||
`_recalculate_progress()` uses `_STAGE_BASE_WEIGHTS` dict (per-stage weights) and computes
|
||
overall % dynamically from actual `burnin_stages` rows — no profile lookup needed.
|
||
|
||
In the UI, both single-drive and batch modals show 3 checkboxes. If surface is unchecked:
|
||
- Destructive warning is hidden
|
||
- Serial confirmation field is hidden (single modal)
|
||
- Confirmation checkbox is hidden (batch modal)
|
||
|
||
### Table Scroll Fix
|
||
```css
|
||
.table-wrap {
|
||
max-height: calc(100vh - 205px); /* header(44) + main-pad(20) + stats-bar(70) + filter-bar(46) + buffer */
|
||
}
|
||
```
|
||
If stats bar or other content height changes, update this offset.
|
||
|
||
## Feature Reference (Stage 6d)
|
||
|
||
### Cancel Functionality
|
||
| What | How |
|
||
|------|-----|
|
||
| Cancel running Short SMART | `✕ Short` button appears in action col when `short_busy`; calls `POST /api/v1/drives/{id}/smart/cancel` with `{type:"short"}` |
|
||
| Cancel running Long SMART | `✕ Long` button appears when `long_busy`; same route with `{type:"long"}` |
|
||
| Cancel individual burn-in | `✕ Burn-In` button (was "Cancel") shown when `bi_active`; calls `POST /api/v1/burnin/{id}/cancel` |
|
||
| Cancel All Running | Red `✕ Cancel All Burn-Ins` button appears in filter bar when any burn-in jobs are active; JS collects all `.btn-cancel[data-job-id]` and cancels each |
|
||
|
||
**SMART cancel route** (`POST /api/v1/drives/{drive_id}/smart/cancel`):
|
||
1. Fetches all running TrueNAS jobs via `client.get_smart_jobs()`
|
||
2. Finds job where `arguments[0].disks` contains the drive's devname
|
||
3. Calls `client.abort_job(tn_job_id)`
|
||
4. Updates `smart_tests` table row to `state='aborted'`
|
||
|
||
### Stage Reordering
|
||
- Default order changed to: **Short SMART → Long SMART → Surface Validate** (non-destructive first)
|
||
- Drag handles (⠿) on each stage row in both single and batch modals
|
||
- HTML5 drag-and-drop, no external library
|
||
- `getStageOrder(listId)` reads current DOM order of checked stages
|
||
- `stage_order: ["short_smart","long_smart","surface_validate"]` sent in API body
|
||
- `StartBurninRequest.stage_order: list[str] | None` — validated against allowed stage names
|
||
- `burnin.start_job()` accepts `stage_order` param; builds: `["precheck"] + stage_order + ["final_check"]`
|
||
- `_run_job()` reads stage names back from `burnin_stages ORDER BY id` — so custom order is honoured
|
||
- Destructive warning / serial confirmation still triggered by `stage-surface` checkbox ID (order-independent)
|
||
|
||
## NPM / DNS Setup
|
||
|
||
- Proxy host: `burnin.hellocomputer.xyz` → `http://10.0.0.138:8080`
|
||
- Authelia protection: recommended (no built-in auth in app)
|
||
- DNS: `burnin.hellocomputer.xyz` CNAME → `sandon.hellocomputer.xyz` (proxied: false)
|