Catches the README, SPEC, and CLAUDE.md that were missed in the 1.0.0-38 product rename. Infrastructure identifiers (paths, container, repo URL) deliberately stay as truenas-burnin. Also refreshes SPEC.md version (1.0.0-8 → 1.0.0-39) and CLAUDE.md last-updated stamp (1.0.0-12 → 1.0.0-39).
297 lines
14 KiB
Markdown
297 lines
14 KiB
Markdown
# NAS Burn-In — Project Specification
|
||
|
||
**Version:** 1.0.0-39
|
||
**Status:** Active Development
|
||
**Audience:** Public / Open Source
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
NAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.
|
||
|
||
The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and `.env` file.
|
||
|
||
---
|
||
|
||
## Core Philosophy
|
||
|
||
- Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
|
||
- Burn-in is always triggered manually. There is no scheduling or automation.
|
||
- Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
|
||
- Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).
|
||
|
||
---
|
||
|
||
## Test Sequence
|
||
|
||
Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.
|
||
|
||
### Stage 1 — Short SMART Test
|
||
```
|
||
smartctl -t short -d sat /dev/sdX
|
||
```
|
||
Polls for completion via:
|
||
```
|
||
smartctl -a -d sat /dev/sdX | grep -i remaining
|
||
```
|
||
Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.
|
||
|
||
### Stage 2 — Long SMART Test
|
||
```
|
||
smartctl -t long -d sat /dev/sdX
|
||
```
|
||
Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.
|
||
|
||
### Stage 3 — Surface Scan (Badblocks, Destructive)
|
||
```
|
||
badblocks -wsv -b 4096 -p 1 /dev/sdX
|
||
```
|
||
This is a **destructive write test**. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The `-w` flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.
|
||
|
||
**Failure threshold:** Any bad blocks found triggers immediate abort and FAILED status by default. The threshold is configurable in Settings (`Bad Block Threshold`, default: 0 — meaning any bad sector = fail).
|
||
|
||
---
|
||
|
||
## SMART Attributes to Monitor
|
||
|
||
The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.
|
||
|
||
| ID | Attribute | Threshold | Notes |
|
||
|-----|----------------------------|--------------|--------------------------------------------|
|
||
| 5 | Reallocated_Sector_Ct | > 0 = FAIL | Any reallocation is disqualifying for ZFS |
|
||
| 10 | Spin_Retry_Count | > 0 = WARN | Mechanical stress indicator |
|
||
| 188 | Command_Timeout | > 0 = WARN | Drive not responding to commands |
|
||
| 197 | Current_Pending_Sector | > 0 = FAIL | Sectors waiting to be reallocated |
|
||
| 198 | Offline_Uncorrectable | > 0 = FAIL | Unrecoverable read errors |
|
||
| 199 | UDMA_CRC_Error_Count | > 0 = WARN | Likely cable/controller, flag for review |
|
||
|
||
---
|
||
|
||
## Failure Behavior
|
||
|
||
When a drive fails at any stage:
|
||
|
||
1. All remaining tests for that drive are immediately cancelled.
|
||
2. The drive is marked `FAILED` in the UI with the specific failure reason (e.g., `FAILED (SURFACE VALIDATE)`, `FAILED (REALLOCATED SECTORS)`).
|
||
3. An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
|
||
4. The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.
|
||
|
||
A **Reset** action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).
|
||
|
||
---
|
||
|
||
## UI
|
||
|
||
### Dashboard (Main View)
|
||
|
||
- **Stats bar:** Total drives, Running, Failed, Passed, Idle counts. When SSH is active, also shows CPU and PCH temperature chips (live via SSE) and a thermal pressure indicator (WARM/HOT) that appears when running drives exceed the warning threshold.
|
||
- **Filter chips:** All / Running / Failed / Passed / Idle — filters the table below.
|
||
- **Drive table columns:** Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
|
||
- **Temperature display:** Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
|
||
- **Running tests:** Show an animated progress bar with percentage and elapsed time instead of a static badge.
|
||
- **Actions per drive:** Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
|
||
- **Row click:** Opens the Log Drawer for that drive.
|
||
|
||
### Log Drawer
|
||
|
||
Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.
|
||
|
||
Three tabs:
|
||
- **Burn-In** — stage-by-stage progress for the latest burn-in job; shows live elapsed time, raw SSH log output (smartctl / badblocks), and bad block count.
|
||
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted (green/yellow/red). Raw `smartctl -a` output also shown when SSH mode is active.
|
||
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, reset, etc.).
|
||
|
||
Features:
|
||
- Auto-scroll toggle (on by default).
|
||
- Blinking cursor on the active output line of a running test.
|
||
- Close button or click the same row again to dismiss.
|
||
- Failed drives show error lines in red with exact bad block sector numbers.
|
||
|
||
### History Page
|
||
|
||
Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.
|
||
|
||
### Audit Page
|
||
|
||
Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.
|
||
|
||
### Stats Page
|
||
|
||
Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.
|
||
|
||
### Settings Page
|
||
|
||
Divided into sections:
|
||
|
||
**EMAIL (SMTP)**
|
||
- Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
|
||
- Test Connection button.
|
||
- Enable/disable toggle.
|
||
|
||
**WEBHOOK**
|
||
- Single URL field. POST JSON payload on `burnin_passed` and `burnin_failed` events.
|
||
- Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
|
||
- Leave blank to disable.
|
||
|
||
**NOTIFICATIONS**
|
||
- Daily Report toggle (sends full drive status email at a configured hour).
|
||
- Alert on Failure toggle (immediate — fires both email and webhook if both configured).
|
||
- Alert on Pass toggle.
|
||
|
||
**BURN-IN BEHAVIOR**
|
||
- Max Parallel Burn-Ins (default: 2, max: 60).
|
||
- Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
|
||
- Bad block failure threshold (default: 0 — any bad sector = fail).
|
||
- Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).
|
||
- **Adaptive thermal gate:** When drive temperatures are at or above the warning threshold, new burn-in jobs wait up to 3 minutes before acquiring a semaphore slot. This reduces thermal pile-up when drives are already running hot.
|
||
|
||
**TEMPERATURE**
|
||
- Warning threshold (default: 46°C).
|
||
- Critical threshold (default: 55°C).
|
||
|
||
**SSH**
|
||
- TrueNAS host/IP.
|
||
- Port (default: 22).
|
||
- Username.
|
||
- Authentication: SSH key (paste or upload) or password.
|
||
- Test Connection button.
|
||
|
||
**SYSTEM** *(restart required to change — set in .env)*
|
||
- TrueNAS API URL.
|
||
- Verify TLS toggle.
|
||
- Poll interval (default: 12s).
|
||
- Stale threshold (default: 45s).
|
||
- IP allowlist.
|
||
- Log level (DEBUG / INFO / WARN / ERROR).
|
||
|
||
**VERSION & UPDATES**
|
||
- Displays current version.
|
||
- "Check for Updates" button — queries Forgejo releases API at `git.hellocomputer.xyz` and shows latest version if an update is available.
|
||
|
||
---
|
||
|
||
## Data Persistence
|
||
|
||
**SQLite** — single file, zero config, atomic writes. No data loss on container restart.
|
||
|
||
On restart, any drive that was in a `running` state is automatically transitioned to `interrupted`. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.
|
||
|
||
Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.
|
||
|
||
---
|
||
|
||
## Notifications
|
||
|
||
### Email
|
||
Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).
|
||
|
||
Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.
|
||
|
||
### Webhook
|
||
Single HTTP POST to configured URL with JSON body:
|
||
```json
|
||
{
|
||
"event": "burnin_failed",
|
||
"drive": "sda",
|
||
"serial": "WDZ1A002",
|
||
"size": "12 TB",
|
||
"failure_reason": "SURFACE VALIDATE",
|
||
"bad_blocks": 2,
|
||
"timestamp": "2025-01-15T03:21:04Z"
|
||
}
|
||
```
|
||
Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.
|
||
|
||
Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.
|
||
|
||
---
|
||
|
||
## SSH Architecture
|
||
|
||
The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for SMART or badblocks operations — all commands are issued directly over SSH using `asyncssh`.
|
||
|
||
This is required for TrueNAS SCALE 25.10 (Electric Eel), which removed the `POST /api/v2.0/smart/test` REST endpoint. SSH is also the only way to run `badblocks`. The TrueNAS REST API is still used for drive discovery (`GET /api/v2.0/disk`) and temperature polling (`POST /api/v2.0/disk/temperatures`).
|
||
|
||
Connection details are configured in Settings (not `.env`). Supports:
|
||
- Password authentication.
|
||
- SSH key authentication — key pasted into Settings UI or mounted as a Docker volume at `/run/secrets/ssh_key` (recommended for production).
|
||
- Custom port (default: 22).
|
||
- Test Connection button validates credentials before saving.
|
||
|
||
In addition to burn-in commands, the SSH connection is used to:
|
||
- Run `sensors -j` (lm-sensors) each poll cycle to read CPU and PCH/chipset temperatures, displayed live in the dashboard stats bar.
|
||
- Poll `smartctl -a` progress during standalone SMART tests.
|
||
|
||
On SSH disconnection mid-test: the app marks the drive as `interrupted`. The remote process may or may not still be running. The user must reset the drive and re-queue.
|
||
|
||
---
|
||
|
||
## API
|
||
|
||
A REST API is available at `/api/v1/`. It is documented via OpenAPI at `/openapi.json` and browsable at `/api` in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).
|
||
|
||
Key endpoints:
|
||
- `GET /api/v1/drives` — list all drives with current status.
|
||
- `GET /api/v1/drives/{drive_id}` — single drive detail.
|
||
- `PATCH /api/v1/drives/{drive_id}` — update drive metadata (e.g., location label).
|
||
- `POST /api/v1/drives/{drive_id}/smart/start` — start a SMART test.
|
||
- `POST /api/v1/drives/{drive_id}/smart/cancel` — cancel a SMART test.
|
||
- `POST /api/v1/burnin/start` — start a burn-in job.
|
||
- `POST /api/v1/burnin/{job_id}/cancel` — cancel a burn-in job.
|
||
- `GET /sse/drives` — Server-Sent Events stream powering the real-time dashboard UI. Also emits `system-sensors` (CPU/PCH temps, thermal pressure) and `job-alert` (browser push notification) events.
|
||
- `GET /health` — health check endpoint.
|
||
|
||
The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.
|
||
|
||
---
|
||
|
||
## Deployment
|
||
|
||
Docker Compose. Minimum viable setup:
|
||
|
||
```bash
|
||
git clone https://github.com/yourusername/truenas-burnin
|
||
cd truenas-burnin
|
||
cp .env.example .env
|
||
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
|
||
docker compose up -d
|
||
```
|
||
|
||
Navigate to `http://your-vm-ip:port` and complete SSH and SMTP configuration in Settings.
|
||
|
||
All other configuration is done through the Settings UI — no manual file editing required beyond `.env` for system-level values.
|
||
|
||
---
|
||
|
||
## TrueNAS Compatibility
|
||
|
||
Tested and confirmed working against **TrueNAS SCALE 25.10.2.1 (Electric Eel)**. Key compatibility notes:
|
||
|
||
- SCALE 25.10 removed `POST /api/v2.0/smart/test` — SSH is required for all SMART operations.
|
||
- Drive temperatures are not included in `GET /api/v2.0/disk` on SCALE — use `POST /api/v2.0/disk/temperatures` instead.
|
||
- TrueNAS SCALE is Linux/Debian-based. Device names are `sda`, `sdb`, etc. (not `ada0`/`da0` as on CORE/FreeBSD).
|
||
- `lm-sensors` is available on SCALE — `sensors -j` returns CPU (`coretemp`) and PCH (`pch_*`) temperatures.
|
||
- `badblocks` and `smartctl` are present at standard paths.
|
||
|
||
## mock-truenas
|
||
|
||
A companion Docker service (`mock-truenas`) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production. Disabled (commented out) in the production `docker-compose.yml`.
|
||
|
||
---
|
||
|
||
## Version
|
||
|
||
- App version: **1.0.0-8** (displayed in header next to the title, and in Settings).
|
||
- Update check in Settings queries Forgejo releases API (`git.hellocomputer.xyz`).
|
||
- API version tracked separately, currently **0.1.0**.
|
||
|
||
---
|
||
|
||
## Out of Scope (v1.0)
|
||
|
||
- Scheduled or automated burn-in triggering.
|
||
- Non-destructive badblocks mode (read-only surface scan).
|
||
- Multi-TrueNAS support (single host only).
|
||
- User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
|
||
- Mobile-optimized UI (desktop dashboard only).
|