Matches the 1.0.0-38 product display rename. Touches every
infrastructure identifier:
- container_name: truenas-burnin → nas-burnin
- forge URL in /api/v1/updates/check
- security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description
- run-tests.sh default container name
- doc paths in README/SPEC/CLAUDE
- in-app instruction strings (login.html, settings.html, auth_cli.py)
Maple migration done in lockstep:
docker compose down (truenas-burnin)
mv ~/docker/stacks/{truenas-burnin,nas-burnin}
systemd unit ExecStart updated + daemon-reload
docker compose up -d --build → container nas-burnin
Old image truenas-burnin-app removed (~12 GB reclaimed)
Stale top-level orphans cleaned (config.py, poller.py, routes.py,
truenas.py, tests/) — all dead since pre-split refactors
Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin →
nas-burnin) is a separate UI-only step. Forgejo redirects the old
URL after rename, so this commit can be pushed to the existing
remote first; remote URL gets updated locally once you rename.
14 KiB
NAS Burn-In — Project Specification
Version: 1.0.0-39 Status: Active Development Audience: Public / Open Source
Overview
NAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.
The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and .env file.
Core Philosophy
- Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
- Burn-in is always triggered manually. There is no scheduling or automation.
- Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
- Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).
Test Sequence
Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.
Stage 1 — Short SMART Test
smartctl -t short -d sat /dev/sdX
Polls for completion via:
smartctl -a -d sat /dev/sdX | grep -i remaining
Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.
Stage 2 — Long SMART Test
smartctl -t long -d sat /dev/sdX
Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.
Stage 3 — Surface Scan (Badblocks, Destructive)
badblocks -wsv -b 4096 -p 1 /dev/sdX
This is a destructive write test. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The -w flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.
Failure threshold: Any bad blocks found triggers immediate abort and FAILED status by default. The threshold is configurable in Settings (Bad Block Threshold, default: 0 — meaning any bad sector = fail).
SMART Attributes to Monitor
The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.
| ID | Attribute | Threshold | Notes |
|---|---|---|---|
| 5 | Reallocated_Sector_Ct | > 0 = FAIL | Any reallocation is disqualifying for ZFS |
| 10 | Spin_Retry_Count | > 0 = WARN | Mechanical stress indicator |
| 188 | Command_Timeout | > 0 = WARN | Drive not responding to commands |
| 197 | Current_Pending_Sector | > 0 = FAIL | Sectors waiting to be reallocated |
| 198 | Offline_Uncorrectable | > 0 = FAIL | Unrecoverable read errors |
| 199 | UDMA_CRC_Error_Count | > 0 = WARN | Likely cable/controller, flag for review |
Failure Behavior
When a drive fails at any stage:
- All remaining tests for that drive are immediately cancelled.
- The drive is marked
FAILEDin the UI with the specific failure reason (e.g.,FAILED (SURFACE VALIDATE),FAILED (REALLOCATED SECTORS)). - An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
- The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.
A Reset action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).
UI
Dashboard (Main View)
- Stats bar: Total drives, Running, Failed, Passed, Idle counts. When SSH is active, also shows CPU and PCH temperature chips (live via SSE) and a thermal pressure indicator (WARM/HOT) that appears when running drives exceed the warning threshold.
- Filter chips: All / Running / Failed / Passed / Idle — filters the table below.
- Drive table columns: Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
- Temperature display: Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
- Running tests: Show an animated progress bar with percentage and elapsed time instead of a static badge.
- Actions per drive: Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
- Row click: Opens the Log Drawer for that drive.
Log Drawer
Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.
Three tabs:
- Burn-In — stage-by-stage progress for the latest burn-in job; shows live elapsed time, raw SSH log output (smartctl / badblocks), and bad block count.
- SMART — output of the last smartctl run for this drive, with monitored attribute values highlighted (green/yellow/red). Raw
smartctl -aoutput also shown when SSH mode is active. - Events — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, reset, etc.).
Features:
- Auto-scroll toggle (on by default).
- Blinking cursor on the active output line of a running test.
- Close button or click the same row again to dismiss.
- Failed drives show error lines in red with exact bad block sector numbers.
History Page
Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.
Audit Page
Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.
Stats Page
Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.
Settings Page
Divided into sections:
EMAIL (SMTP)
- Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
- Test Connection button.
- Enable/disable toggle.
WEBHOOK
- Single URL field. POST JSON payload on
burnin_passedandburnin_failedevents. - Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
- Leave blank to disable.
NOTIFICATIONS
- Daily Report toggle (sends full drive status email at a configured hour).
- Alert on Failure toggle (immediate — fires both email and webhook if both configured).
- Alert on Pass toggle.
BURN-IN BEHAVIOR
- Max Parallel Burn-Ins (default: 2, max: 60).
- Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
- Bad block failure threshold (default: 0 — any bad sector = fail).
- Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).
- Adaptive thermal gate: When drive temperatures are at or above the warning threshold, new burn-in jobs wait up to 3 minutes before acquiring a semaphore slot. This reduces thermal pile-up when drives are already running hot.
TEMPERATURE
- Warning threshold (default: 46°C).
- Critical threshold (default: 55°C).
SSH
- TrueNAS host/IP.
- Port (default: 22).
- Username.
- Authentication: SSH key (paste or upload) or password.
- Test Connection button.
SYSTEM (restart required to change — set in .env)
- TrueNAS API URL.
- Verify TLS toggle.
- Poll interval (default: 12s).
- Stale threshold (default: 45s).
- IP allowlist.
- Log level (DEBUG / INFO / WARN / ERROR).
VERSION & UPDATES
- Displays current version.
- "Check for Updates" button — queries Forgejo releases API at
git.hellocomputer.xyzand shows latest version if an update is available.
Data Persistence
SQLite — single file, zero config, atomic writes. No data loss on container restart.
On restart, any drive that was in a running state is automatically transitioned to interrupted. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.
Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.
Notifications
Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).
Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.
Webhook
Single HTTP POST to configured URL with JSON body:
{
"event": "burnin_failed",
"drive": "sda",
"serial": "WDZ1A002",
"size": "12 TB",
"failure_reason": "SURFACE VALIDATE",
"bad_blocks": 2,
"timestamp": "2025-01-15T03:21:04Z"
}
Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.
Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.
SSH Architecture
The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for SMART or badblocks operations — all commands are issued directly over SSH using asyncssh.
This is required for TrueNAS SCALE 25.10 (Electric Eel), which removed the POST /api/v2.0/smart/test REST endpoint. SSH is also the only way to run badblocks. The TrueNAS REST API is still used for drive discovery (GET /api/v2.0/disk) and temperature polling (POST /api/v2.0/disk/temperatures).
Connection details are configured in Settings (not .env). Supports:
- Password authentication.
- SSH key authentication — key pasted into Settings UI or mounted as a Docker volume at
/run/secrets/ssh_key(recommended for production). - Custom port (default: 22).
- Test Connection button validates credentials before saving.
In addition to burn-in commands, the SSH connection is used to:
- Run
sensors -j(lm-sensors) each poll cycle to read CPU and PCH/chipset temperatures, displayed live in the dashboard stats bar. - Poll
smartctl -aprogress during standalone SMART tests.
On SSH disconnection mid-test: the app marks the drive as interrupted. The remote process may or may not still be running. The user must reset the drive and re-queue.
API
A REST API is available at /api/v1/. It is documented via OpenAPI at /openapi.json and browsable at /api in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).
Key endpoints:
GET /api/v1/drives— list all drives with current status.GET /api/v1/drives/{drive_id}— single drive detail.PATCH /api/v1/drives/{drive_id}— update drive metadata (e.g., location label).POST /api/v1/drives/{drive_id}/smart/start— start a SMART test.POST /api/v1/drives/{drive_id}/smart/cancel— cancel a SMART test.POST /api/v1/burnin/start— start a burn-in job.POST /api/v1/burnin/{job_id}/cancel— cancel a burn-in job.GET /sse/drives— Server-Sent Events stream powering the real-time dashboard UI. Also emitssystem-sensors(CPU/PCH temps, thermal pressure) andjob-alert(browser push notification) events.GET /health— health check endpoint.
The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.
Deployment
Docker Compose. Minimum viable setup:
git clone https://github.com/yourusername/nas-burnin
cd nas-burnin
cp .env.example .env
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
docker compose up -d
Navigate to http://your-vm-ip:port and complete SSH and SMTP configuration in Settings.
All other configuration is done through the Settings UI — no manual file editing required beyond .env for system-level values.
TrueNAS Compatibility
Tested and confirmed working against TrueNAS SCALE 25.10.2.1 (Electric Eel). Key compatibility notes:
- SCALE 25.10 removed
POST /api/v2.0/smart/test— SSH is required for all SMART operations. - Drive temperatures are not included in
GET /api/v2.0/diskon SCALE — usePOST /api/v2.0/disk/temperaturesinstead. - TrueNAS SCALE is Linux/Debian-based. Device names are
sda,sdb, etc. (notada0/da0as on CORE/FreeBSD). lm-sensorsis available on SCALE —sensors -jreturns CPU (coretemp) and PCH (pch_*) temperatures.badblocksandsmartctlare present at standard paths.
mock-truenas
A companion Docker service (mock-truenas) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production. Disabled (commented out) in the production docker-compose.yml.
Version
- App version: 1.0.0-8 (displayed in header next to the title, and in Settings).
- Update check in Settings queries Forgejo releases API (
git.hellocomputer.xyz). - API version tracked separately, currently 0.1.0.
Out of Scope (v1.0)
- Scheduled or automated burn-in triggering.
- Non-destructive badblocks mode (read-only surface scan).
- Multi-TrueNAS support (single host only).
- User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
- Mobile-optimized UI (desktop dashboard only).