diff --git a/README.md b/README.md index 5335f17..4c0da7b 100644 --- a/README.md +++ b/README.md @@ -83,11 +83,12 @@ runtime roughly in half at ~2× RAM cost — matches the upstream ### Watch out -- **Stuck-job timeout** — `stuck_job_hours` (default 24) marks any job - past that threshold as `unknown` and kills the remote process. If - you're burning in 14 TB drives with default block size, raise this to - **48** in Settings before starting, or you'll get false positives near - the end of surface_validate. +- **Stuck-job timeout** — `stuck_job_hours` (default 168 = 7 days) + marks any job past that threshold as `unknown` and kills the remote + process. The default covers `-w` surface_validate on 14 TB+ HDDs with + margin. If you're running short SSDs and want faster detection of + genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h + which false-positived on multi-TB drives.) - **Thermal gate** — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but @@ -144,7 +145,8 @@ All settings live under `/settings` (header link). Key knobs: - **`surface_validate_block_size` / `_block_buffer` / `_passes`** — badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; tune for speed vs paranoia. -- **`stuck_job_hours`** (default 24) — raise for big drives. +- **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs; + drop for faster detection on small fast drives. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`bad_block_threshold`** (default 0) — number of bad blocks surface_validate tolerates before failing the stage. diff --git a/app/config.py b/app/config.py index eaa43c9..ef97840 100644 --- a/app/config.py +++ b/app/config.py @@ -49,7 +49,10 @@ class Settings(BaseSettings): webhook_url: str = "" # Stuck-job detection: jobs running longer than this are marked 'unknown' - stuck_job_hours: int = 24 + # and the remote badblocks/smartctl is killed. 168h (7 days) covers a + # full -w surface_validate on a 14 TB+ HDD with margin. Older default + # was 24h which false-positived on multi-TB drives almost every time. + stuck_job_hours: int = 168 # Temperature thresholds (°C) — drives table colouring + precheck gate temp_warn_c: int = 46 # orange warning @@ -83,7 +86,7 @@ class Settings(BaseSettings): ssh_key: str = "" # PEM private key content (paste full key including headers) # Application version — used by the /api/v1/updates/check endpoint - app_version: str = "1.0.0-42" + app_version: str = "1.0.0-43" # ---- Authentication (1.0.0-22) ---- # session_secret: HMAC key for signing session cookies. Empty = generate