nas-burnin/SPEC.md

# NAS Burn-In — Project Specification

**Version:** 1.0.0-39
**Status:** Active Development
**Audience:** Public / Open Source

---

## Overview

NAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.

The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and `.env` file.

---

## Core Philosophy

- Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
- Burn-in is always triggered manually. There is no scheduling or automation.
- Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
- Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).

---

## Test Sequence

Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.

### Stage 1 — Short SMART Test
```
smartctl -t short -d sat /dev/sdX
```
Polls for completion via:
```
smartctl -a -d sat /dev/sdX | grep -i remaining
```
Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.

### Stage 2 — Long SMART Test
```
smartctl -t long -d sat /dev/sdX
```
Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.

### Stage 3 — Surface Scan (Badblocks, Destructive)
```
badblocks -wsv -b 4096 -p 1 /dev/sdX
```
This is a **destructive write test**. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The `-w` flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.

**Failure threshold:** Any bad blocks found triggers immediate abort and FAILED status by default. The threshold is configurable in Settings (`Bad Block Threshold`, default: 0 — meaning any bad sector = fail).

---

## SMART Attributes to Monitor

The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.

| ID  | Attribute                  | Threshold    | Notes                                      |
|-----|----------------------------|--------------|--------------------------------------------|
| 5   | Reallocated_Sector_Ct      | > 0 = FAIL   | Any reallocation is disqualifying for ZFS  |
| 10  | Spin_Retry_Count           | > 0 = WARN   | Mechanical stress indicator                |
| 188 | Command_Timeout            | > 0 = WARN   | Drive not responding to commands           |
| 197 | Current_Pending_Sector     | > 0 = FAIL   | Sectors waiting to be reallocated          |
| 198 | Offline_Uncorrectable      | > 0 = FAIL   | Unrecoverable read errors                  |
| 199 | UDMA_CRC_Error_Count       | > 0 = WARN   | Likely cable/controller, flag for review   |

---

## Failure Behavior

When a drive fails at any stage:

1. All remaining tests for that drive are immediately cancelled.
2. The drive is marked `FAILED` in the UI with the specific failure reason (e.g., `FAILED (SURFACE VALIDATE)`, `FAILED (REALLOCATED SECTORS)`).
3. An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
4. The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.

A **Reset** action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).

---

## UI

### Dashboard (Main View)

- **Stats bar:** Total drives, Running, Failed, Passed, Idle counts. When SSH is active, also shows CPU and PCH temperature chips (live via SSE) and a thermal pressure indicator (WARM/HOT) that appears when running drives exceed the warning threshold.
- **Filter chips:** All / Running / Failed / Passed / Idle — filters the table below.
- **Drive table columns:** Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
- **Temperature display:** Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
- **Running tests:** Show an animated progress bar with percentage and elapsed time instead of a static badge.
- **Actions per drive:** Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
- **Row click:** Opens the Log Drawer for that drive.

### Log Drawer

Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.

Three tabs:
- **Burn-In** — stage-by-stage progress for the latest burn-in job; shows live elapsed time, raw SSH log output (smartctl / badblocks), and bad block count.
- **SMART** — output of the last smartctl run for this drive, with monitored attribute values highlighted (green/yellow/red). Raw `smartctl -a` output also shown when SSH mode is active.
- **Events** — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, reset, etc.).

Features:
- Auto-scroll toggle (on by default).
- Blinking cursor on the active output line of a running test.
- Close button or click the same row again to dismiss.
- Failed drives show error lines in red with exact bad block sector numbers.

### History Page

Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.

### Audit Page

Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.

### Stats Page

Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.

### Settings Page

Divided into sections:

**EMAIL (SMTP)**
- Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
- Test Connection button.
- Enable/disable toggle.

**WEBHOOK**
- Single URL field. POST JSON payload on `burnin_passed` and `burnin_failed` events.
- Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
- Leave blank to disable.

**NOTIFICATIONS**
- Daily Report toggle (sends full drive status email at a configured hour).
- Alert on Failure toggle (immediate — fires both email and webhook if both configured).
- Alert on Pass toggle.

**BURN-IN BEHAVIOR**
- Max Parallel Burn-Ins (default: 2, max: 60).
- Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
- Bad block failure threshold (default: 0 — any bad sector = fail).
- Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).
- **Adaptive thermal gate:** When drive temperatures are at or above the warning threshold, new burn-in jobs wait up to 3 minutes before acquiring a semaphore slot. This reduces thermal pile-up when drives are already running hot.

**TEMPERATURE**
- Warning threshold (default: 46°C).
- Critical threshold (default: 55°C).

**SSH**
- TrueNAS host/IP.
- Port (default: 22).
- Username.
- Authentication: SSH key (paste or upload) or password.
- Test Connection button.

**SYSTEM** *(restart required to change — set in .env)*
- TrueNAS API URL.
- Verify TLS toggle.
- Poll interval (default: 12s).
- Stale threshold (default: 45s).
- IP allowlist.
- Log level (DEBUG / INFO / WARN / ERROR).

**VERSION & UPDATES**
- Displays current version.
- "Check for Updates" button — queries Forgejo releases API at `git.hellocomputer.xyz` and shows latest version if an update is available.

---

## Data Persistence

**SQLite** — single file, zero config, atomic writes. No data loss on container restart.

On restart, any drive that was in a `running` state is automatically transitioned to `interrupted`. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.

Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.

---

## Notifications

### Email
Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).

Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.

### Webhook
Single HTTP POST to configured URL with JSON body:
```json
{
  "event": "burnin_failed",
  "drive": "sda",
  "serial": "WDZ1A002",
  "size": "12 TB",
  "failure_reason": "SURFACE VALIDATE",
  "bad_blocks": 2,
  "timestamp": "2025-01-15T03:21:04Z"
}
```
Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.

Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.

---

## SSH Architecture

The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for SMART or badblocks operations — all commands are issued directly over SSH using `asyncssh`.

This is required for TrueNAS SCALE 25.10 (Electric Eel), which removed the `POST /api/v2.0/smart/test` REST endpoint. SSH is also the only way to run `badblocks`. The TrueNAS REST API is still used for drive discovery (`GET /api/v2.0/disk`) and temperature polling (`POST /api/v2.0/disk/temperatures`).

Connection details are configured in Settings (not `.env`). Supports:
- Password authentication.
- SSH key authentication — key pasted into Settings UI or mounted as a Docker volume at `/run/secrets/ssh_key` (recommended for production).
- Custom port (default: 22).
- Test Connection button validates credentials before saving.

In addition to burn-in commands, the SSH connection is used to:
- Run `sensors -j` (lm-sensors) each poll cycle to read CPU and PCH/chipset temperatures, displayed live in the dashboard stats bar.
- Poll `smartctl -a` progress during standalone SMART tests.

On SSH disconnection mid-test: the app marks the drive as `interrupted`. The remote process may or may not still be running. The user must reset the drive and re-queue.

---

## API

A REST API is available at `/api/v1/`. It is documented via OpenAPI at `/openapi.json` and browsable at `/api` in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).

Key endpoints:
- `GET /api/v1/drives` — list all drives with current status.
- `GET /api/v1/drives/{drive_id}` — single drive detail.
- `PATCH /api/v1/drives/{drive_id}` — update drive metadata (e.g., location label).
- `POST /api/v1/drives/{drive_id}/smart/start` — start a SMART test.
- `POST /api/v1/drives/{drive_id}/smart/cancel` — cancel a SMART test.
- `POST /api/v1/burnin/start` — start a burn-in job.
- `POST /api/v1/burnin/{job_id}/cancel` — cancel a burn-in job.
- `GET /sse/drives` — Server-Sent Events stream powering the real-time dashboard UI. Also emits `system-sensors` (CPU/PCH temps, thermal pressure) and `job-alert` (browser push notification) events.
- `GET /health` — health check endpoint.

The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.

---

## Deployment

Docker Compose. Minimum viable setup:

```bash
git clone https://github.com/yourusername/truenas-burnin
cd truenas-burnin
cp .env.example .env
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
docker compose up -d
```

Navigate to `http://your-vm-ip:port` and complete SSH and SMTP configuration in Settings.

All other configuration is done through the Settings UI — no manual file editing required beyond `.env` for system-level values.

---

## TrueNAS Compatibility

Tested and confirmed working against **TrueNAS SCALE 25.10.2.1 (Electric Eel)**. Key compatibility notes:

- SCALE 25.10 removed `POST /api/v2.0/smart/test` — SSH is required for all SMART operations.
- Drive temperatures are not included in `GET /api/v2.0/disk` on SCALE — use `POST /api/v2.0/disk/temperatures` instead.
- TrueNAS SCALE is Linux/Debian-based. Device names are `sda`, `sdb`, etc. (not `ada0`/`da0` as on CORE/FreeBSD).
- `lm-sensors` is available on SCALE — `sensors -j` returns CPU (`coretemp`) and PCH (`pch_*`) temperatures.
- `badblocks` and `smartctl` are present at standard paths.

## mock-truenas

A companion Docker service (`mock-truenas`) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production. Disabled (commented out) in the production `docker-compose.yml`.

---

## Version

- App version: **1.0.0-8** (displayed in header next to the title, and in Settings).
- Update check in Settings queries Forgejo releases API (`git.hellocomputer.xyz`).
- API version tracked separately, currently **0.1.0**.

---

## Out of Scope (v1.0)

- Scheduled or automated burn-in triggering.
- Non-destructive badblocks mode (read-only surface scan).
- Multi-TrueNAS support (single host only).
- User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
- Mobile-optimized UI (desktop dashboard only).