fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run

A user with 4× 14 TB WD HDDs running -w surface_validate had all
4 jobs marked 'unknown' at exactly 24h+1min — the stuck-job
detector firing on legitimate work because 14 TB at 8192-block
badblocks needs ~5+ days to complete all 4 patterns × 2 phases.

168h covers a full -w pass on 14 TB+ HDDs with margin. Anyone
running short SSDs who wants faster detection can drop the value
in Settings → Burn-in.

README warning replaced — no longer instructs users to bump the
threshold before starting big-drive burn-ins, since the default
now handles that case.

Settings UI already accepts up to 168 via the input's max=168
attribute, so no template change needed.
This commit is contained in:
Brandon Walter 2026-05-08 13:23:05 -07:00
parent b406e3f315
commit 4922b19a9f
2 changed files with 13 additions and 8 deletions

View file

@ -83,11 +83,12 @@ runtime roughly in half at ~2× RAM cost — matches the upstream
### Watch out ### Watch out
- **Stuck-job timeout**`stuck_job_hours` (default 24) marks any job - **Stuck-job timeout**`stuck_job_hours` (default 168 = 7 days)
past that threshold as `unknown` and kills the remote process. If marks any job past that threshold as `unknown` and kills the remote
you're burning in 14 TB drives with default block size, raise this to process. The default covers `-w` surface_validate on 14 TB+ HDDs with
**48** in Settings before starting, or you'll get false positives near margin. If you're running short SSDs and want faster detection of
the end of surface_validate. genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h
which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the - **Thermal gate** — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
@ -144,7 +145,8 @@ All settings live under `/settings` (header link). Key knobs:
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** — - **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
tune for speed vs paranoia. tune for speed vs paranoia.
- **`stuck_job_hours`** (default 24) — raise for big drives. - **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs;
drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks - **`bad_block_threshold`** (default 0) — number of bad blocks
surface_validate tolerates before failing the stage. surface_validate tolerates before failing the stage.

View file

@ -49,7 +49,10 @@ class Settings(BaseSettings):
webhook_url: str = "" webhook_url: str = ""
# Stuck-job detection: jobs running longer than this are marked 'unknown' # Stuck-job detection: jobs running longer than this are marked 'unknown'
stuck_job_hours: int = 24 # and the remote badblocks/smartctl is killed. 168h (7 days) covers a
# full -w surface_validate on a 14 TB+ HDD with margin. Older default
# was 24h which false-positived on multi-TB drives almost every time.
stuck_job_hours: int = 168
# Temperature thresholds (°C) — drives table colouring + precheck gate # Temperature thresholds (°C) — drives table colouring + precheck gate
temp_warn_c: int = 46 # orange warning temp_warn_c: int = 46 # orange warning
@ -83,7 +86,7 @@ class Settings(BaseSettings):
ssh_key: str = "" # PEM private key content (paste full key including headers) ssh_key: str = "" # PEM private key content (paste full key including headers)
# Application version — used by the /api/v1/updates/check endpoint # Application version — used by the /api/v1/updates/check endpoint
app_version: str = "1.0.0-42" app_version: str = "1.0.0-43"
# ---- Authentication (1.0.0-22) ---- # ---- Authentication (1.0.0-22) ----
# session_secret: HMAC key for signing session cookies. Empty = generate # session_secret: HMAC key for signing session cookies. Empty = generate