Full-stack burn-in orchestration dashboard (Stages 1–6d complete): FastAPI backend, SQLite/WAL, SSE live dashboard, mock TrueNAS server, SMTP/webhook notifications, batch burn-in, settings UI, audit log, stats page, cancel SMART/burn-in, drag-to-reorder stages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
296 lines
13 KiB
Markdown
296 lines
13 KiB
Markdown
# TrueNAS Burn-In — Project Specification
|
||
|
||
**Version:** 0.5.0
|
||
**Status:** Active Development
|
||
**Audience:** Public / Open Source
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
TrueNAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.
|
||
|
||
The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and `.env` file.
|
||
|
||
---
|
||
|
||
## Core Philosophy
|
||
|
||
- Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
|
||
- Burn-in is always triggered manually. There is no scheduling or automation.
|
||
- Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
|
||
- Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).
|
||
|
||
---
|
||
|
||
## Test Sequence
|
||
|
||
Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.
|
||
|
||
### Stage 1 — Short SMART Test
|
||
```
|
||
smartctl -t short -d sat /dev/sdX
|
||
```
|
||
Polls for completion via:
|
||
```
|
||
smartctl -a -d sat /dev/sdX | grep -i remaining
|
||
```
|
||
Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.
|
||
|
||
### Stage 2 — Long SMART Test
|
||
```
|
||
smartctl -t long -d sat /dev/sdX
|
||
```
|
||
Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.
|
||
|
||
### Stage 3 — Surface Scan (Badblocks, Destructive)
|
||
```
|
||
badblocks -wsv -b 4096 -p 1 /dev/sdX
|
||
```
|
||
This is a **destructive write test**. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The `-w` flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.
|
||
|
||
**Failure threshold:** 2 or more bad blocks found triggers immediate abort and FAILED status. The threshold should be configurable in Settings (default: 2).
|
||
|
||
---
|
||
|
||
## SMART Attributes to Monitor
|
||
|
||
The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.
|
||
|
||
| ID | Attribute | Threshold | Notes |
|
||
|-----|----------------------------|--------------|--------------------------------------------|
|
||
| 5 | Reallocated_Sector_Ct | > 0 = FAIL | Any reallocation is disqualifying for ZFS |
|
||
| 10 | Spin_Retry_Count | > 0 = WARN | Mechanical stress indicator |
|
||
| 188 | Command_Timeout | > 0 = WARN | Drive not responding to commands |
|
||
| 197 | Current_Pending_Sector | > 0 = FAIL | Sectors waiting to be reallocated |
|
||
| 198 | Offline_Uncorrectable | > 0 = FAIL | Unrecoverable read errors |
|
||
| 199 | UDMA_CRC_Error_Count | > 0 = WARN | Likely cable/controller, flag for review |
|
||
|
||
---
|
||
|
||
## Failure Behavior
|
||
|
||
When a drive fails at any stage:
|
||
|
||
1. All remaining tests for that drive are immediately cancelled.
|
||
2. The drive is marked `FAILED` in the UI with the specific failure reason (e.g., `FAILED (SURFACE VALIDATE)`, `FAILED (REALLOCATED SECTORS)`).
|
||
3. An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
|
||
4. The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.
|
||
|
||
A **Reset** action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).
|
||
|
||
---
|
||
|
||
## UI
|
||
|
||
### Dashboard (Main View)
|
||
|
||
- **Stats bar:** Total drives, Running, Failed, Passed, Idle counts.
|
||
- **Filter chips:** All / Running / Failed / Passed / Idle — filters the table below.
|
||
- **Drive table columns:** Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
|
||
- **Temperature display:** Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
|
||
- **Running tests:** Show an animated progress bar with percentage and elapsed time instead of a static badge.
|
||
- **Actions per drive:** Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
|
||
- **Row click:** Opens the Log Drawer for that drive.
|
||
|
||
### Log Drawer
|
||
|
||
Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.
|
||
|
||
Three tabs:
|
||
- **badblocks** — live tail of badblocks stdout, including error lines with sector numbers highlighted in red.
|
||
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted.
|
||
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, etc.).
|
||
|
||
Features:
|
||
- Auto-scroll toggle (on by default).
|
||
- Blinking cursor on the active output line of a running test.
|
||
- Close button or click the same row again to dismiss.
|
||
- Failed drives show error lines in red with exact bad block sector numbers.
|
||
|
||
### History Page
|
||
|
||
Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.
|
||
|
||
### Audit Page
|
||
|
||
Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.
|
||
|
||
### Stats Page
|
||
|
||
Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.
|
||
|
||
### Settings Page
|
||
|
||
Divided into sections:
|
||
|
||
**EMAIL (SMTP)**
|
||
- Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
|
||
- Test Connection button.
|
||
- Enable/disable toggle.
|
||
|
||
**WEBHOOK**
|
||
- Single URL field. POST JSON payload on `burnin_passed` and `burnin_failed` events.
|
||
- Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
|
||
- Leave blank to disable.
|
||
|
||
**NOTIFICATIONS**
|
||
- Daily Report toggle (sends full drive status email at a configured hour).
|
||
- Alert on Failure toggle (immediate — fires both email and webhook if both configured).
|
||
- Alert on Pass toggle.
|
||
|
||
**BURN-IN BEHAVIOR**
|
||
- Max Parallel Burn-Ins (default: 2, max: 60).
|
||
- Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
|
||
- Bad block failure threshold (default: 2).
|
||
- Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).
|
||
|
||
**TEMPERATURE**
|
||
- Warning threshold (default: 46°C).
|
||
- Critical threshold (default: 55°C).
|
||
|
||
**SSH**
|
||
- TrueNAS host/IP.
|
||
- Port (default: 22).
|
||
- Username.
|
||
- Authentication: SSH key (paste or upload) or password.
|
||
- Test Connection button.
|
||
|
||
**SYSTEM** *(restart required to change — set in .env)*
|
||
- TrueNAS API URL.
|
||
- Verify TLS toggle.
|
||
- Poll interval (default: 12s).
|
||
- Stale threshold (default: 45s).
|
||
- IP allowlist.
|
||
- Log level (DEBUG / INFO / WARN / ERROR).
|
||
|
||
**VERSION & UPDATES**
|
||
- Displays current version (starting at 0.5.0).
|
||
- "Check for Updates" button — queries GitHub releases API and shows latest version with a link if an update is available.
|
||
|
||
---
|
||
|
||
## Data Persistence
|
||
|
||
**SQLite** — single file, zero config, atomic writes. No data loss on container restart.
|
||
|
||
On restart, any drive that was in a `running` state is automatically transitioned to `interrupted`. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.
|
||
|
||
Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.
|
||
|
||
---
|
||
|
||
## Notifications
|
||
|
||
### Email
|
||
Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).
|
||
|
||
Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.
|
||
|
||
### Webhook
|
||
Single HTTP POST to configured URL with JSON body:
|
||
```json
|
||
{
|
||
"event": "burnin_failed",
|
||
"drive": "sda",
|
||
"serial": "WDZ1A002",
|
||
"size": "12 TB",
|
||
"failure_reason": "SURFACE VALIDATE",
|
||
"bad_blocks": 2,
|
||
"timestamp": "2025-01-15T03:21:04Z"
|
||
}
|
||
```
|
||
Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.
|
||
|
||
Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.
|
||
|
||
---
|
||
|
||
## SSH Architecture
|
||
|
||
The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for drive operations — all smartctl and badblocks commands are issued directly over SSH.
|
||
|
||
Connection details are configured in Settings (not `.env`). Supports:
|
||
- Password authentication.
|
||
- SSH key authentication (key pasted or uploaded in Settings UI).
|
||
- Custom port.
|
||
- Test Connection button validates credentials before saving.
|
||
|
||
On SSH disconnection mid-test: the test process on TrueNAS may continue running (SSH disconnection does not kill the remote process if launched correctly with nohup or similar). The app marks the drive as `interrupted` in its own state, attempts to reconnect, and resumes polling if the process is still running. If the remote process is gone, the drive stays `interrupted`.
|
||
|
||
---
|
||
|
||
## API
|
||
|
||
A REST API is available at `/api/v1/`. It is documented via OpenAPI at `/openapi.json` and browsable at `/api` in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).
|
||
|
||
Key endpoints:
|
||
- `GET /api/v1/drives` — list all drives with current status.
|
||
- `GET /api/v1/drives/{drive_id}` — single drive detail.
|
||
- `PATCH /api/v1/drives/{drive_id}` — update drive metadata (e.g., location label).
|
||
- `POST /api/v1/drives/{drive_id}/smart/start` — start a SMART test.
|
||
- `POST /api/v1/drives/{drive_id}/smart/cancel` — cancel a SMART test.
|
||
- `POST /api/v1/burnin/start` — start a burn-in job.
|
||
- `POST /api/v1/burnin/{job_id}/cancel` — cancel a burn-in job.
|
||
- `GET /sse/drives` — Server-Sent Events stream powering the real-time dashboard UI.
|
||
- `GET /health` — health check endpoint.
|
||
|
||
The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.
|
||
|
||
---
|
||
|
||
## Deployment
|
||
|
||
Docker Compose. Minimum viable setup:
|
||
|
||
```bash
|
||
git clone https://github.com/yourusername/truenas-burnin
|
||
cd truenas-burnin
|
||
cp .env.example .env
|
||
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
|
||
docker compose up -d
|
||
```
|
||
|
||
Navigate to `http://your-vm-ip:port` and complete SSH and SMTP configuration in Settings.
|
||
|
||
All other configuration is done through the Settings UI — no manual file editing required beyond `.env` for system-level values.
|
||
|
||
---
|
||
|
||
## mock-truenas
|
||
|
||
A companion Docker service (`mock-truenas`) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production.
|
||
|
||
### Testing on Real TrueNAS (v1.0 Milestone Plan)
|
||
|
||
To validate against real hardware:
|
||
|
||
1. Switch `TRUENAS_URL` in `.env` from `http://mock-truenas:8000` to your real TrueNAS IP/hostname.
|
||
2. Ensure SSH is enabled on TrueNAS (System → Services → SSH).
|
||
3. Configure SSH credentials in Settings and use Test Connection to verify.
|
||
4. Start with a single idle drive — run Short SMART only first.
|
||
5. Verify the log drawer shows real smartctl output.
|
||
6. If successful, proceed to Long SMART, then a full burn-in on a drive you're comfortable wiping.
|
||
7. Confirm an alert email is received on completion.
|
||
8. Scale to 2–4 drives simultaneously and monitor system resource warnings.
|
||
|
||
**v1.0 is considered production-ready when:** the app runs reliably on a real TrueNAS system with 10 simultaneous drives, a failure alert email is received correctly, and a passing drive's history is preserved across a container restart.
|
||
|
||
---
|
||
|
||
## Version
|
||
|
||
- App version starts at **0.5.0**
|
||
- Displayed on the dashboard landing page header and in Settings.
|
||
- Update check in Settings queries GitHub releases API.
|
||
- API version tracked separately, currently **0.1.0**.
|
||
|
||
---
|
||
|
||
## Out of Scope (v1.0)
|
||
|
||
- Scheduled or automated burn-in triggering.
|
||
- Non-destructive badblocks mode (read-only surface scan).
|
||
- Multi-TrueNAS support (single host only).
|
||
- User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
|
||
- Mobile-optimized UI (desktop dashboard only).
|