brandon/nas-burnin

Fork 0

Brandon Walter 8ae84862de

Security scan / pip-audit (push) Waiting to run

Details

Security scan / bandit (push) Waiting to run

Details

Security scan / gitleaks (push) Waiting to run

Details

Security scan / mypy (push) Waiting to run

Details

infra: rename truenas-burnin → nas-burnin (1.0.0-41)

Matches the 1.0.0-38 product display rename. Touches every
infrastructure identifier:

- container_name: truenas-burnin → nas-burnin
- forge URL in /api/v1/updates/check
- security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description
- run-tests.sh default container name
- doc paths in README/SPEC/CLAUDE
- in-app instruction strings (login.html, settings.html, auth_cli.py)

Maple migration done in lockstep:
  docker compose down (truenas-burnin)
  mv ~/docker/stacks/{truenas-burnin,nas-burnin}
  systemd unit ExecStart updated + daemon-reload
  docker compose up -d --build → container nas-burnin
  Old image truenas-burnin-app removed (~12 GB reclaimed)
  Stale top-level orphans cleaned (config.py, poller.py, routes.py,
  truenas.py, tests/) — all dead since pre-split refactors

Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin →
nas-burnin) is a separate UI-only step. Forgejo redirects the old
URL after rename, so this commit can be pushed to the existing
remote first; remote URL gets updated locally once you rename.

2026-05-04 07:16:02 -07:00

14 KiB

Raw Blame History

NAS Burn-In — Project Specification

Version: 1.0.0-39 Status: Active Development Audience: Public / Open Source

Overview

NAS Burn-In is a self-hosted web dashboard that runs on a separate machine or VM and connects to a TrueNAS system via SSH to automate and monitor the drive burn-in process. It is designed for users who want to validate new hard drives before adding them to a ZFS pool — where reliability is non-negotiable.

The app is not a TrueNAS plugin and does not run on TrueNAS itself. It connects remotely over SSH to issue smartctl and badblocks commands, polls results, and presents everything through a dark-themed real-time dashboard. It is deployed via Docker Compose and configured through a Settings UI and .env file.

Core Philosophy

Drives going into a ZFS pool must be rock solid. The app's job is to surface any doubt about a drive before it earns a slot in the pool.
Burn-in is always triggered manually. There is no scheduling or automation.
Simplicity over features. The README and Settings UI should be sufficient for any technically capable user to be up and running without hand-holding.
Recommend safe defaults. Warn loudly when users push limits (too many parallel jobs, destructive operations, high temperatures).

Test Sequence

Every drive goes through the following stages in order. A failure at any stage stops that drive immediately.

Stage 1 — Short SMART Test

smartctl -t short -d sat /dev/sdX

Polls for completion via:

smartctl -a -d sat /dev/sdX | grep -i remaining

Expected duration: ~2 minutes. If the test fails or reports any critical attribute violation, the drive is marked FAILED and no further tests run.

Stage 2 — Long SMART Test

smartctl -t long -d sat /dev/sdX

Expected duration: varies by drive size (typically 3–6 hours for 8–12TB drives). Same polling approach. Same failure behavior.

Stage 3 — Surface Scan (Badblocks, Destructive)

badblocks -wsv -b 4096 -p 1 /dev/sdX

This is a destructive write test. The UI must display a prominent warning before this stage begins, and again in the Settings page where the behavior is documented. The -w flag overwrites all data on the drive. This is intentional — these are new drives being validated before pool use.

Failure threshold: Any bad blocks found triggers immediate abort and FAILED status by default. The threshold is configurable in Settings (Bad Block Threshold, default: 0 — meaning any bad sector = fail).

SMART Attributes to Monitor

The following attributes are checked after each SMART test and continuously during the burn-in run. Any non-zero value on pre-fail attributes is treated as a warning; crossing defined thresholds triggers failure.

ID	Attribute	Threshold	Notes
5	Reallocated_Sector_Ct	> 0 = FAIL	Any reallocation is disqualifying for ZFS
10	Spin_Retry_Count	> 0 = WARN	Mechanical stress indicator
188	Command_Timeout	> 0 = WARN	Drive not responding to commands
197	Current_Pending_Sector	> 0 = FAIL	Sectors waiting to be reallocated
198	Offline_Uncorrectable	> 0 = FAIL	Unrecoverable read errors
199	UDMA_CRC_Error_Count	> 0 = WARN	Likely cable/controller, flag for review

Failure Behavior

When a drive fails at any stage:

All remaining tests for that drive are immediately cancelled.
The drive is marked FAILED in the UI with the specific failure reason (e.g., FAILED (SURFACE VALIDATE), FAILED (REALLOCATED SECTORS)).
An alert is fired immediately via whichever notification channels are enabled in Settings (email and/or webhook — both can fire simultaneously).
The failed drive's row is visually distinct in the dashboard and cannot be accidentally re-queued without an explicit reset action.

A Reset action clears the test state for a drive so it can be re-queued. It does not cancel in-progress tests — the Cancel button does that. Reset is only available on completed drives (passed, failed, or interrupted).

UI

Dashboard (Main View)

Stats bar: Total drives, Running, Failed, Passed, Idle counts. When SSH is active, also shows CPU and PCH temperature chips (live via SSE) and a thermal pressure indicator (WARM/HOT) that appears when running drives exceed the warning threshold.
Filter chips: All / Running / Failed / Passed / Idle — filters the table below.
Drive table columns: Drive (device name + model), Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In, Actions.
Temperature display: Color-coded. Green ≤ 45°C, Yellow 46–54°C, Red ≥ 55°C. Thresholds configurable in Settings.
Running tests: Show an animated progress bar with percentage and elapsed time instead of a static badge.
Actions per drive: Short, Long, Burn-In buttons. Cancel button replaces Start when a test is running.
Row click: Opens the Log Drawer for that drive.

Log Drawer

Slides up from the bottom of the page when a drive row is clicked. Does not navigate away — the table remains visible and scrollable above.

Three tabs:

Burn-In — stage-by-stage progress for the latest burn-in job; shows live elapsed time, raw SSH log output (smartctl / badblocks), and bad block count.
SMART — output of the last smartctl run for this drive, with monitored attribute values highlighted (green/yellow/red). Raw smartctl -a output also shown when SSH mode is active.
Events — chronological timeline of everything that happened to this drive (test started, test passed, failure detected, alert sent, reset, etc.).

Features:

Auto-scroll toggle (on by default).
Blinking cursor on the active output line of a running test.
Close button or click the same row again to dismiss.
Failed drives show error lines in red with exact bad block sector numbers.

History Page

Per-drive history. Each drive (identified by serial number) has a log of every burn-in run ever performed, with timestamps, results, and duration. Not per-session — per individual drive.

Audit Page

Application-level event log. Records: test started, test cancelled, settings changed, alert sent, container restarted, SSH connection lost/restored. Useful for debugging and for open source users troubleshooting their setup.

Stats Page

Aggregate statistics across all drives and all time. Pass rate, average test duration by drive size, failure breakdown by failure type.

Settings Page

Divided into sections:

EMAIL (SMTP)

Host, Mode (STARTTLS/SSL/plain), Port, Timeout, Username, Password, From, To.
Test Connection button.
Enable/disable toggle.

WEBHOOK

Single URL field. POST JSON payload on burnin_passed and burnin_failed events.
Compatible with ntfy.sh, Slack, Discord, n8n, and any generic HTTP POST receiver.
Leave blank to disable.

NOTIFICATIONS

Daily Report toggle (sends full drive status email at a configured hour).
Alert on Failure toggle (immediate — fires both email and webhook if both configured).
Alert on Pass toggle.

BURN-IN BEHAVIOR

Max Parallel Burn-Ins (default: 2, max: 60).
Warning displayed inline when set above 8: "Running many simultaneous surface scans may saturate your storage controller and produce unreliable results. Recommended: 2–4."
Bad block failure threshold (default: 0 — any bad sector = fail).
Stuck job threshold in hours (default: 24 — jobs running longer than this are auto-marked Unknown).
Adaptive thermal gate: When drive temperatures are at or above the warning threshold, new burn-in jobs wait up to 3 minutes before acquiring a semaphore slot. This reduces thermal pile-up when drives are already running hot.

TEMPERATURE

Warning threshold (default: 46°C).
Critical threshold (default: 55°C).

SSH

TrueNAS host/IP.
Port (default: 22).
Username.
Authentication: SSH key (paste or upload) or password.
Test Connection button.

SYSTEM (restart required to change — set in .env)

TrueNAS API URL.
Verify TLS toggle.
Poll interval (default: 12s).
Stale threshold (default: 45s).
IP allowlist.
Log level (DEBUG / INFO / WARN / ERROR).

VERSION & UPDATES

Displays current version.
"Check for Updates" button — queries Forgejo releases API at git.hellocomputer.xyz and shows latest version if an update is available.

Data Persistence

SQLite — single file, zero config, atomic writes. No data loss on container restart.

On restart, any drive that was in a running state is automatically transitioned to interrupted. The user sees "INTERRUPTED" in the burn-in column and must manually reset and re-queue the drive. The partial log up to the point of interruption is preserved and viewable in the Log Drawer.

Drive location labels persist in SQLite tied to serial number, so a drive's label survives container restarts and reappears automatically when the drive is detected again.

Notifications

Email

Standard SMTP. Fires on: burn-in failure (immediate), burn-in pass (if enabled), daily report (scheduled).

Failure email includes: drive name, serial number, size, failure stage, failure reason, bad block count (if applicable), SMART attribute snapshot, timestamp.

Webhook

Single HTTP POST to configured URL with JSON body:

{
  "event": "burnin_failed",
  "drive": "sda",
  "serial": "WDZ1A002",
  "size": "12 TB",
  "failure_reason": "SURFACE VALIDATE",
  "bad_blocks": 2,
  "timestamp": "2025-01-15T03:21:04Z"
}

Compatible with ntfy.sh, Slack incoming webhooks, Discord webhooks, n8n HTTP trigger nodes.

Both email and webhook fire simultaneously when both are configured and enabled. User controls each independently via Settings toggles.

SSH Architecture

The app connects to TrueNAS over SSH from the host running the Docker container. It does not use the TrueNAS web API for SMART or badblocks operations — all commands are issued directly over SSH using asyncssh.

This is required for TrueNAS SCALE 25.10 (Electric Eel), which removed the POST /api/v2.0/smart/test REST endpoint. SSH is also the only way to run badblocks. The TrueNAS REST API is still used for drive discovery (GET /api/v2.0/disk) and temperature polling (POST /api/v2.0/disk/temperatures).

Connection details are configured in Settings (not .env). Supports:

Password authentication.
SSH key authentication — key pasted into Settings UI or mounted as a Docker volume at /run/secrets/ssh_key (recommended for production).
Custom port (default: 22).
Test Connection button validates credentials before saving.

In addition to burn-in commands, the SSH connection is used to:

Run sensors -j (lm-sensors) each poll cycle to read CPU and PCH/chipset temperatures, displayed live in the dashboard stats bar.
Poll smartctl -a progress during standalone SMART tests.

On SSH disconnection mid-test: the app marks the drive as interrupted. The remote process may or may not still be running. The user must reset the drive and re-queue.

API

A REST API is available at /api/v1/. It is documented via OpenAPI at /openapi.json and browsable at /api in the dashboard. Version displayed: 0.1.0 (API version tracked independently from app version).

Key endpoints:

GET /api/v1/drives — list all drives with current status.
GET /api/v1/drives/{drive_id} — single drive detail.
PATCH /api/v1/drives/{drive_id} — update drive metadata (e.g., location label).
POST /api/v1/drives/{drive_id}/smart/start — start a SMART test.
POST /api/v1/drives/{drive_id}/smart/cancel — cancel a SMART test.
POST /api/v1/burnin/start — start a burn-in job.
POST /api/v1/burnin/{job_id}/cancel — cancel a burn-in job.
GET /sse/drives — Server-Sent Events stream powering the real-time dashboard UI. Also emits system-sensors (CPU/PCH temps, thermal pressure) and job-alert (browser push notification) events.
GET /health — health check endpoint.

The API makes this app a strong candidate for MCP server integration, allowing an AI assistant to query drive status, start tests, or receive alerts conversationally.

Deployment

Docker Compose. Minimum viable setup:

git clone https://github.com/yourusername/nas-burnin
cd nas-burnin
cp .env.example .env
# Edit .env for system-level settings (TrueNAS URL, poll interval, etc.)
docker compose up -d

Navigate to http://your-vm-ip:port and complete SSH and SMTP configuration in Settings.

All other configuration is done through the Settings UI — no manual file editing required beyond .env for system-level values.

TrueNAS Compatibility

Tested and confirmed working against TrueNAS SCALE 25.10.2.1 (Electric Eel). Key compatibility notes:

SCALE 25.10 removed POST /api/v2.0/smart/test — SSH is required for all SMART operations.
Drive temperatures are not included in GET /api/v2.0/disk on SCALE — use POST /api/v2.0/disk/temperatures instead.
TrueNAS SCALE is Linux/Debian-based. Device names are sda, sdb, etc. (not ada0/da0 as on CORE/FreeBSD).
lm-sensors is available on SCALE — sensors -j returns CPU (coretemp) and PCH (pch_*) temperatures.
badblocks and smartctl are present at standard paths.

mock-truenas

A companion Docker service (mock-truenas) that simulates the TrueNAS API for UI development and testing without real hardware. It mocks drive discovery, SMART test responses, and badblocks progress. Used exclusively for development — not deployed in production. Disabled (commented out) in the production docker-compose.yml.

Version

App version: 1.0.0-8 (displayed in header next to the title, and in Settings).
Update check in Settings queries Forgejo releases API (git.hellocomputer.xyz).
API version tracked separately, currently 0.1.0.

Out of Scope (v1.0)

Scheduled or automated burn-in triggering.
Non-destructive badblocks mode (read-only surface scan).
Multi-TrueNAS support (single host only).
User authentication / login wall (single-user, self-hosted, IP allowlist is sufficient).
Mobile-optimized UI (desktop dashboard only).

14 KiB Raw Blame History Unescape Escape

NAS Burn-In — Project Specification

Overview

Core Philosophy

Test Sequence

Stage 1 — Short SMART Test

Stage 2 — Long SMART Test

Stage 3 — Surface Scan (Badblocks, Destructive)

SMART Attributes to Monitor

Failure Behavior

UI

Dashboard (Main View)

Log Drawer

History Page

Audit Page

Stats Page

Settings Page

Data Persistence

Notifications

Email

Webhook

SSH Architecture

API

Deployment

TrueNAS Compatibility

mock-truenas

Version

Out of Scope (v1.0)

14 KiB

Raw Blame History