truenas-burnin/SPEC.md

# TrueNAS Burn-In — Project Specification

**Version:** 0.5.0
**Status:** Active Development
**Audience:** Public / Open Source

---

## Overview

TrueNAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.

The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and `.env` file.

---

## Core Philosophy

- Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
- Burn-in is always triggered manually. There is no scheduling or automation.
- Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
- Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).

---

## Test Sequence

Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.

### Stage 1 — Short SMART Test
```
smartctl -t short -d sat /dev/sdX
```
Polls for completion via:
```
smartctl -a -d sat /dev/sdX | grep -i remaining
```
Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.

### Stage 2 — Long SMART Test
```
smartctl -t long -d sat /dev/sdX
```
Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.

### Stage 3 — Surface Scan (Badblocks, Destructive)
```
badblocks -wsv -b 4096 -p 1 /dev/sdX
```
This is a **destructive write test**. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The `-w` flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.

**Failure threshold:** 2 or more bad blocks found triggers immediate abort and FAILED status. The threshold should be configurable in Settings (default: 2).

---

## SMART Attributes to Monitor

The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.

| ID  | Attribute                  | Threshold    | Notes                                      |
|-----|----------------------------|--------------|--------------------------------------------|
| 5   | Reallocated_Sector_Ct      | > 0 = FAIL   | Any reallocation is disqualifying for ZFS  |
| 10  | Spin_Retry_Count           | > 0 = WARN   | Mechanical stress indicator                |
| 188 | Command_Timeout            | > 0 = WARN   | Drive not responding to commands           |
| 197 | Current_Pending_Sector     | > 0 = FAIL   | Sectors waiting to be reallocated          |
| 198 | Offline_Uncorrectable      | > 0 = FAIL   | Unrecoverable read errors                  |
| 199 | UDMA_CRC_Error_Count       | > 0 = WARN   | Likely cable/controller, flag for review   |

---

## Failure Behavior

When a drive fails at any stage:

1. All remaining tests for that drive are immediately cancelled.
2. The drive is marked `FAILED` in the UI with the specific failure reason (e.g., `FAILED (SURFACE VALIDATE)`, `FAILED (REALLOCATED SECTORS)`).
3. An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
4. The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.

A **Reset** action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).

---

## UI

### Dashboard (Main View)

- **Stats bar:** Total drives, Running, Failed, Passed, Idle counts.
- **Filter chips:** All / Running / Failed / Passed / Idle — filters the table below.
- **Drive table columns:** Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
- **Temperature display:** Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
- **Running tests:** Show an animated progress bar with percentage and elapsed time instead of a static badge.
- **Actions per drive:** Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
- **Row click:** Opens the Log Drawer for that drive.

### Log Drawer

Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.

Three tabs:
- **badblocks** — live tail of badblocks stdout, including error lines with sector numbers highlighted in red.
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted.
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, etc.).

Features:
- Auto-scroll toggle (on by default).
- Blinking cursor on the active output line of a running test.
- Close button or click the same row again to dismiss.
- Failed drives show error lines in red with exact bad block sector numbers.

### History Page

Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.

### Audit Page

Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.

### Stats Page

Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.

### Settings Page

Divided into sections:

**EMAIL (SMTP)**
- Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
- Test Connection button.
- Enable/disable toggle.

**WEBHOOK**
- Single URL field. POST JSON payload on `burnin_passed` and `burnin_failed` events.
- Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
- Leave blank to disable.

**NOTIFICATIONS**
- Daily Report toggle (sends full drive status email at a configured hour).
- Alert on Failure toggle (immediate — fires both email and webhook if both configured).
- Alert on Pass toggle.

**BURN-IN BEHAVIOR**
- Max Parallel Burn-Ins (default: 2, max: 60).
- Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
- Bad block failure threshold (default: 2).
- Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).

**TEMPERATURE**
- Warning threshold (default: 46°C).
- Critical threshold (default: 55°C).

**SSH**
- TrueNAS host/IP.
- Port (default: 22).
- Username.
- Authentication: SSH key (paste or upload) or password.
- Test Connection button.

**SYSTEM** *(restart required to change — set in .env)*
- TrueNAS API URL.
- Verify TLS toggle.
- Poll interval (default: 12s).
- Stale threshold (default: 45s).
- IP allowlist.
- Log level (DEBUG / INFO / WARN / ERROR).

**VERSION & UPDATES**
- Displays current version (starting at 0.5.0).
- "Check for Updates" button — queries GitHub releases API and shows latest version with a link if an update is available.

---

## Data Persistence

**SQLite** — single file, zero config, atomic writes. No data loss on container restart.

On restart, any drive that was in a `running` state is automatically transitioned to `interrupted`. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.

Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.

---

## Notifications

### Email
Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).

Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.

### Webhook
Single HTTP POST to configured URL with JSON body:
```json
{
  "event": "burnin_failed",
  "drive": "sda",
  "serial": "WDZ1A002",
  "size": "12 TB",
  "failure_reason": "SURFACE VALIDATE",
  "bad_blocks": 2,
  "timestamp": "2025-01-15T03:21:04Z"
}
```
Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.

Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.

---

## SSH Architecture

The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for drive operations — all smartctl and badblocks commands are issued directly over SSH.

Connection details are configured in Settings (not `.env`). Supports:
- Password authentication.
- SSH key authentication (key pasted or uploaded in Settings UI).
- Custom port.
- Test Connection button validates credentials before saving.

On SSH disconnection mid-test: the test process on TrueNAS may continue running (SSH disconnection does not kill the remote process if launched correctly with nohup or similar). The app marks the drive as `interrupted` in its own state, attempts to reconnect, and resumes polling if the process is still running. If the remote process is gone, the drive stays `interrupted`.

---

## API

A REST API is available at `/api/v1/`. It is documented via OpenAPI at `/openapi.json` and browsable at `/api` in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).

Key endpoints:
- `GET /api/v1/drives` — list all drives with current status.
- `GET /api/v1/drives/{drive_id}` — single drive detail.
- `PATCH /api/v1/drives/{drive_id}` — update drive metadata (e.g., location label).
- `POST /api/v1/drives/{drive_id}/smart/start` — start a SMART test.
- `POST /api/v1/drives/{drive_id}/smart/cancel` — cancel a SMART test.
- `POST /api/v1/burnin/start` — start a burn-in job.
- `POST /api/v1/burnin/{job_id}/cancel` — cancel a burn-in job.
- `GET /sse/drives` — Server-Sent Events stream powering the real-time dashboard UI.
- `GET /health` — health check endpoint.

The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.

---

## Deployment

Docker Compose. Minimum viable setup:

```bash
git clone https://github.com/yourusername/truenas-burnin
cd truenas-burnin
cp .env.example .env
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
docker compose up -d
```

Navigate to `http://your-vm-ip:port` and complete SSH and SMTP configuration in Settings.

All other configuration is done through the Settings UI — no manual file editing required beyond `.env` for system-level values.

---

## mock-truenas

A companion Docker service (`mock-truenas`) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production.

### Testing on Real TrueNAS (v1.0 Milestone Plan)

To validate against real hardware:

1. Switch `TRUENAS_URL` in `.env` from `http://mock-truenas:8000` to your real TrueNAS IP/hostname.
2. Ensure SSH is enabled on TrueNAS (System → Services → SSH).
3. Configure SSH credentials in Settings and use Test Connection to verify.
4. Start with a single idle drive — run Short SMART only first.
5. Verify the log drawer shows real smartctl output.
6. If successful, proceed to Long SMART, then a full burn-in on a drive you're comfortable wiping.
7. Confirm an alert email is received on completion.
8. Scale to 2–4 drives simultaneously and monitor system resource warnings.

**v1.0 is considered production-ready when:** the app runs reliably on a real TrueNAS system with 10 simultaneous drives, a failure alert email is received correctly, and a passing drive's history is preserved across a container restart.

---

## Version

- App version starts at **0.5.0**
- Displayed on the dashboard landing page header and in Settings.
- Update check in Settings queries GitHub releases API.
- API version tracked separately, currently **0.1.0**.

---

## Out of Scope (v1.0)

- Scheduled or automated burn-in triggering.
- Non-destructive badblocks mode (read-only surface scan).
- Multi-TrueNAS support (single host only).
- User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
- Mobile-optimized UI (desktop dashboard only).