docs: update CLAUDE.md for Stage 7; bump version to 1.0.0-7
Documents all Stage 7 features: SSH burn-in architecture, SMART attr monitoring, drive reset, version badge, stats polish, new env vars, new API routes, and real-TrueNAS cutover steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
2dff58bd52
commit
fc33c0d11e
2 changed files with 124 additions and 18 deletions
140
CLAUDE.md
140
CLAUDE.md
|
|
@ -1,7 +1,7 @@
|
||||||
# TrueNAS Burn-In Dashboard — Project Context
|
# TrueNAS Burn-In Dashboard — Project Context
|
||||||
|
|
||||||
> Drop this file in any new Claude session to resume work with full context.
|
> Drop this file in any new Claude session to resume work with full context.
|
||||||
> Last updated: 2026-02-22 (Stage 6d)
|
> Last updated: 2026-02-24 (Stage 7)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -28,7 +28,7 @@ against a TrueNAS CORE instance. Deployed on **maple.local** (10.0.0.138).
|
||||||
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
|
| 6b | UX overhaul (stats bar, alerts, batch, notifications, location, print, analytics) | ✅ |
|
||||||
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
|
| 6c | Settings overhaul (editable form, runtime store, SMTP fix, stage selection) | ✅ |
|
||||||
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
|
| 6d | Cancel SMART tests, Cancel All burn-ins, drag-to-reorder stages in modals | ✅ |
|
||||||
| 7 | Cut to real TrueNAS | 🔲 future |
|
| 7 | SSH burn-in execution, SMART attr monitoring, drive reset, version badge, stats polish | ✅ |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -52,6 +52,7 @@ truenas-burnin/
|
||||||
├── database.py # schema, migrations, init_db(), get_db()
|
├── database.py # schema, migrations, init_db(), get_db()
|
||||||
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
|
├── models.py # Pydantic v2 models; StartBurninRequest has run_surface/run_short/run_long + profile property
|
||||||
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
|
├── settings_store.py # runtime settings store — persists to /data/settings_overrides.json
|
||||||
|
├── ssh_client.py # asyncssh client: smartctl parsing, badblocks streaming, test_connection
|
||||||
├── truenas.py # httpx async client with retry (lambda factory pattern)
|
├── truenas.py # httpx async client with retry (lambda factory pattern)
|
||||||
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
|
├── poller.py # poll loop, SSE pub/sub, stale detection, stuck-job check
|
||||||
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
|
├── burnin.py # orchestrator, semaphore, stages, check_stuck_jobs()
|
||||||
|
|
@ -72,8 +73,8 @@ truenas-burnin/
|
||||||
├── history.html
|
├── history.html
|
||||||
├── job_detail.html # + Print/Export button
|
├── job_detail.html # + Print/Export button
|
||||||
├── audit.html # audit event log
|
├── audit.html # audit event log
|
||||||
├── stats.html # analytics: pass rate by model, daily activity
|
├── stats.html # analytics: pass rate by model, daily activity, duration by size, failures by stage
|
||||||
├── settings.html # editable 2-col form: SMTP (left) + Notifications/Behavior/Webhook (right)
|
├── settings.html # editable 2-col form: SMTP + SSH (left) + Notifications/Behavior/Webhook/System (right)
|
||||||
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
|
├── job_print.html # print view with client-side QR code (qrcodejs CDN)
|
||||||
└── components/
|
└── components/
|
||||||
├── drives_table.html # checkboxes, elapsed time, location inline edit
|
├── drives_table.html # checkboxes, elapsed time, location inline edit
|
||||||
|
|
@ -129,10 +130,19 @@ burnin_jobs (id, drive_id FK, profile, state CHECK(queued/running/passed/
|
||||||
|
|
||||||
-- burnin_stages: one row per stage per job
|
-- burnin_stages: one row per stage per job
|
||||||
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
|
burnin_stages (id, burnin_job_id FK, stage_name, state, percent,
|
||||||
started_at, finished_at, error_text)
|
started_at, finished_at, error_text,
|
||||||
|
log_text TEXT, -- raw smartctl/badblocks SSH output
|
||||||
|
bad_blocks INTEGER) -- bad sector count from surface_validate
|
||||||
|
|
||||||
-- audit_events: append-only log
|
-- audit_events: append-only log
|
||||||
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
|
audit_events (id, event_type, drive_id, job_id, operator, note, created_at)
|
||||||
|
|
||||||
|
-- drives columns added by migrations:
|
||||||
|
-- location TEXT, notes TEXT (Stage 6b)
|
||||||
|
-- smart_attrs TEXT -- JSON blob of last SMART attribute snapshot (Stage 7)
|
||||||
|
|
||||||
|
-- smart_tests columns added by migrations:
|
||||||
|
-- raw_output TEXT -- raw smartctl -a output (Stage 7)
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
@ -194,6 +204,15 @@ All read from `.env` via `pydantic-settings`. See `.env.example` for full list.
|
||||||
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
|
| `SMTP_ALERT_ON_FAIL` | `true` | Immediate email when a job fails |
|
||||||
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
|
| `SMTP_ALERT_ON_PASS` | `false` | Immediate email when a job passes |
|
||||||
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
|
| `WEBHOOK_URL` | `` | POST JSON on burnin_passed/burnin_failed. Works with ntfy, Slack, Discord, n8n |
|
||||||
|
| `TEMP_WARN_C` | `46` | Temperature warning threshold (°C) |
|
||||||
|
| `TEMP_CRIT_C` | `55` | Temperature critical threshold — precheck fails above this |
|
||||||
|
| `BAD_BLOCK_THRESHOLD` | `0` | Max bad blocks allowed before surface_validate fails (0 = any bad = fail) |
|
||||||
|
| `APP_VERSION` | `1.0.0-7` | Displayed in header version badge |
|
||||||
|
| `SSH_HOST` | `` | TrueNAS SSH hostname/IP — empty disables SSH mode (uses mock/REST) |
|
||||||
|
| `SSH_PORT` | `22` | TrueNAS SSH port |
|
||||||
|
| `SSH_USER` | `root` | TrueNAS SSH username |
|
||||||
|
| `SSH_PASSWORD` | `` | TrueNAS SSH password (use key instead for production) |
|
||||||
|
| `SSH_KEY` | `` | TrueNAS SSH private key PEM string — loaded in-memory, never written to disk |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
@ -305,27 +324,114 @@ async def burnin_get(job_id: int, ...): ...
|
||||||
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
|
| First row clipped after Stage 6b | Stats bar added 70px but max-height not updated | `max-height: calc(100vh - 205px)` |
|
||||||
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
|
| SMTP "Connection unexpectedly closed" | `_send_email` used `settings.smtp_port` (587 default) even in SSL mode | Derive port from mode via `_MODE_PORTS` dict; SSL→465, STARTTLS→587, Plain→25 |
|
||||||
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
|
| SSL mode missing EHLO | `smtplib.SMTP_SSL` was created without calling `ehlo()` | Added `server.ehlo()` after both SSL and STARTTLS connections |
|
||||||
|
| `profile` NameError in `_execute_stages` | `_execute_stages` called `_recalculate_progress(job_id, profile)` but `profile` not in scope | Changed to `_recalculate_progress(job_id)` — profile param was unused |
|
||||||
|
| `app_version` Jinja2 global rendered as function | Set `templates.env.globals["app_version"] = _get_app_version` (callable) | Set to the static string value directly: `= _settings.app_version` |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Stage 7 — Cutting to Real TrueNAS (TODO)
|
## Feature Reference (Stage 7)
|
||||||
|
|
||||||
|
### SSH Burn-In Architecture
|
||||||
|
|
||||||
|
`ssh_client.py` provides an optional SSH execution layer. When `SSH_HOST` is set (and key or password is present), all burn-in stages run real commands over SSH against TrueNAS. When `SSH_HOST` is empty, stages fall back to mock/REST simulation.
|
||||||
|
|
||||||
|
**Dual-mode dispatch** — each stage checks `ssh_client.is_configured()`:
|
||||||
|
```python
|
||||||
|
if ssh_client.is_configured():
|
||||||
|
# run smartctl / badblocks over SSH
|
||||||
|
else:
|
||||||
|
# simulate with REST API or timed sleep (mock mode)
|
||||||
|
```
|
||||||
|
|
||||||
|
**SSH client capabilities** (`ssh_client.py`):
|
||||||
|
- `test_connection()` → `{"ok": bool, "error": str}` — used by Test SSH button
|
||||||
|
- `get_smart_attributes(devname)` → parse `smartctl -a`, return `{health, raw_output, attributes, warnings, failures}`
|
||||||
|
- `start_smart_test(devname, test_type)` → `smartctl -t short|long /dev/{devname}`
|
||||||
|
- `poll_smart_progress(devname)` → `smartctl -a` during test; returns `{state, percent_remaining, output}`
|
||||||
|
- `abort_smart_test(devname)` → `smartctl -X /dev/{devname}`
|
||||||
|
- `run_badblocks(devname, on_progress, cancelled_fn)` → streams `badblocks -wsv -b 4096 -p 1`; counts bad sectors from stdout (digit-only lines)
|
||||||
|
|
||||||
|
**Key auth pattern** — key is stored as PEM string in settings, never written to disk:
|
||||||
|
```python
|
||||||
|
asyncssh.connect(host, ..., client_keys=[asyncssh.import_private_key(pem_str)], known_hosts=None)
|
||||||
|
```
|
||||||
|
|
||||||
|
**badblocks streaming** — uses `asyncssh.create_process()` with parallel stdout/stderr draining via `asyncio.gather`. Progress updates written to DB every 20 lines to avoid excessive writes.
|
||||||
|
|
||||||
|
### SMART Attribute Monitoring
|
||||||
|
|
||||||
|
Monitored attributes and their thresholds:
|
||||||
|
|
||||||
|
| ID | Name | Any non-zero → |
|
||||||
|
|----|------|----------------|
|
||||||
|
| 5 | Reallocated_Sector_Ct | FAIL |
|
||||||
|
| 10 | Spin_Retry_Count | WARN |
|
||||||
|
| 188 | Command_Timeout | WARN |
|
||||||
|
| 197 | Current_Pending_Sector | FAIL |
|
||||||
|
| 198 | Offline_Uncorrectable | FAIL |
|
||||||
|
| 199 | UDMA_CRC_Error_Count | WARN |
|
||||||
|
|
||||||
|
SMART attrs stored as JSON blob in `drives.smart_attrs`. Updated by `final_check` stage (SSH mode) or `short_smart`/`long_smart` REST mode. Displayed in drive drawer with colour-coded table + raw `smartctl -a` output.
|
||||||
|
|
||||||
|
### Drive Reset Action
|
||||||
|
|
||||||
|
- `POST /api/v1/drives/{drive_id}/reset` — clears `smart_tests` rows to idle, clears `drives.smart_attrs`, writes audit event, notifies SSE subscribers
|
||||||
|
- Button appears in action column when `can_reset` = drive has no active burn-in AND has any non-idle smart state or smart attrs
|
||||||
|
- Burn-in history (burnin_jobs, burnin_stages) is preserved — reset only affects SMART test state
|
||||||
|
|
||||||
|
### New Routes (Stage 7)
|
||||||
|
|
||||||
|
| Method | Path | Description |
|
||||||
|
|--------|------|-------------|
|
||||||
|
| `POST` | `/api/v1/drives/{id}/reset` | Reset SMART state and attrs for a drive |
|
||||||
|
| `POST` | `/api/v1/settings/test-ssh` | Test SSH connection with current SSH settings |
|
||||||
|
| `GET` | `/api/v1/updates/check` | Check for latest release from Forgejo git.hellocomputer.xyz |
|
||||||
|
|
||||||
|
### Check for Updates
|
||||||
|
|
||||||
|
Settings page has a "Check for Updates" button that fetches:
|
||||||
|
```
|
||||||
|
GET https://git.hellocomputer.xyz/api/v1/repos/brandon/truenas-burnin/releases/latest
|
||||||
|
```
|
||||||
|
Compares tag name against `settings.app_version`; shows "up to date" or "v{tag} available".
|
||||||
|
|
||||||
|
### Version Badge
|
||||||
|
|
||||||
|
`app_version` set as Jinja2 global in `renderer.py`:
|
||||||
|
```python
|
||||||
|
templates.env.globals["app_version"] = _settings.app_version
|
||||||
|
```
|
||||||
|
Displayed in header as `<span class="header-version">v{app_version}</span>` (right side, muted).
|
||||||
|
|
||||||
|
### Configurable Thresholds
|
||||||
|
|
||||||
|
`renderer.py` `_temp_class` now reads from settings instead of hardcoded values:
|
||||||
|
```python
|
||||||
|
if temp >= settings.temp_crit_c: return "temp-crit"
|
||||||
|
if temp >= settings.temp_warn_c: return "temp-warn"
|
||||||
|
```
|
||||||
|
`precheck` stage fails if `temperature_c >= settings.temp_crit_c`.
|
||||||
|
|
||||||
|
Surface validate fails if `bad_blocks > settings.bad_block_threshold` (default 0 = any bad sector = fail).
|
||||||
|
|
||||||
|
### Cutting to Real TrueNAS (Next Steps)
|
||||||
|
|
||||||
When ready to test against a real TrueNAS CORE box:
|
When ready to test against a real TrueNAS CORE box:
|
||||||
|
|
||||||
1. In `.env` on maple.local, set:
|
1. In Settings (or `.env`), set:
|
||||||
```env
|
- **TrueNAS URL** → `https://10.0.0.X` (real IP)
|
||||||
TRUENAS_BASE_URL=https://10.0.0.203 # or whatever your TrueNAS IP is
|
- **API Key** → real API key
|
||||||
TRUENAS_API_KEY=your-real-key-here
|
- **SSH Host** → same IP as TrueNAS
|
||||||
TRUENAS_VERIFY_TLS=false # unless you have a valid cert
|
- **SSH User** → `root` (or sudoer with smartctl/badblocks access)
|
||||||
```
|
- **SSH Key** → paste PEM key into textarea
|
||||||
2. Comment out `mock-truenas` service in `docker-compose.yml` (or leave it running — harmless)
|
2. Click **Test SSH Connection** to verify before starting a burn-in
|
||||||
3. Verify TrueNAS CORE v2.0 API contract matches what `truenas.py` expects:
|
3. TrueNAS CORE uses `ada0`, `da0` device names (not `sda`). Mock drive names will differ.
|
||||||
|
4. Delete `app.db` before first real poll to clear mock drive rows
|
||||||
|
5. Comment out `mock-truenas` service in `docker-compose.yml` (optional — harmless to leave)
|
||||||
|
6. Verify TrueNAS CORE v2.0 REST API:
|
||||||
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
|
- `GET /api/v2.0/disk` returns list with `name`, `serial`, `model`, `size`, `temperature`
|
||||||
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
|
- `GET /api/v2.0/core/get_jobs` with filter `[["method","=","smart.test"]]`
|
||||||
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
|
- `POST /api/v2.0/smart/test` accepts `{disks: [devname], type: "SHORT"|"LONG"}`
|
||||||
4. Check that disk names match expected format (TrueNAS CORE uses `ada0`, `da0`, etc. — not `sda`)
|
|
||||||
- You may need to update mock drive names back or adjust poller logic
|
|
||||||
5. Delete `app.db` to clear mock drive rows before first real poll
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -68,7 +68,7 @@ class Settings(BaseSettings):
|
||||||
ssh_key: str = "" # PEM private key content (paste full key including headers)
|
ssh_key: str = "" # PEM private key content (paste full key including headers)
|
||||||
|
|
||||||
# Application version — used by the /api/v1/updates/check endpoint
|
# Application version — used by the /api/v1/updates/check endpoint
|
||||||
app_version: str = "1.0.0-6d"
|
app_version: str = "1.0.0-7"
|
||||||
|
|
||||||
|
|
||||||
settings = Settings()
|
settings = Settings()
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue