Jobs 60-63 ran healthy for 16h then all 4 died simultaneously with
'database is locked'. The burnin drain used _db() which set
busy_timeout=10000, but:
1. 10s was sometimes too short under heavy contention (4 burn-in
drains writing every 5s + poller every 12s + retention scan +
auth + lifespan = many concurrent writers).
2. OTHER aiosqlite.connect() sites (poller, retention, auth, mailer,
routes/__init__'s SSE, burnin/__init__.py's various helpers,
database.get_db) didn't set busy_timeout at all. Without it,
SQLite raises 'database is locked' INSTANTLY on any contention,
which forced concurrency back onto the drain's connection.
Fix:
- _db() busy_timeout 10000 → 60000 (60s; aggressive but right for
this workload — brief contention spikes are normal and waiting
beats failing).
- PRAGMA busy_timeout=60000 added on every aiosqlite.connect() site
next to the existing PRAGMA calls. Applied via a small Python
pass that preserves the original variable name (db / _tdb / src
/ dst etc.) and indentation.
Same restart sequence applied: rebuild container, reset 4 drives,
relaunch via loopback bypass. Jobs 64-67 are now running.
This is auto-restart #2 in 24h. Safety brake at 3.