nas-burnin/app/mailer.py
Brandon Walter 5da1a1704f feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.

1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
  (sshd ignores SSH signal-channel requests by default), so a cancel of a
  long-running badblocks left the remote process running and proc.wait()
  hanging — pinning the asyncio.Semaphore slot forever.

  * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
    capture the remote PID; store in burnin._remote_pids[job_id].
  * burnin._kill_remote_process(job_id) opens a fresh SSH session and
    issues `kill -9 <pid>` — sshd honours that.
  * Bound proc.wait() with asyncio.wait_for(timeout=15).
  * burnin._active_tasks tracks every _run_job task so cancel_job and
    check_stuck_jobs can actually cancel the asyncio task (was DB-only
    before). Also fixes the documented asyncio.create_task GC gotcha
    (weak refs only).
  * _run_job finalizer reads current state and skips the write if state
    != 'running' so cancelled/unknown aren't clobbered.

1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
  health/poll timestamps; devname/serial/model/size_bytes were stuck at
  first-INSERT values forever. After kernel SCSI re-enumeration two
  drives could both show as `sda`. Fixed by updating all six fields.
  Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
  off the dashboard while audit/burnin_jobs FKs stay intact.

1.0.0-15/-16 — pool-membership lock.
  * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
    the flattened TrueNAS output (container vdevs + their device children
    both appear at depth 1; section markers cache/log/spare/special/dedup
    switch the role).
  * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
    to detect drives carrying ZFS labels not in any active pool — they
    get pool_name='(exported)', pool_role='exported'.
  * Three idempotent ALTER TABLE migrations on drives:
    pool_name/pool_role/pool_seen_at.
  * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
    the drive isn't in burnin._unlock_grants. Routes layer maps to 409
    with structured detail {pool_name, pool_role, pool_locked: true} so
    the frontend can render an unlock affordance.
  * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
    reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
    for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
    chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
    pool_drive_unlocked / boot_pool_drive_unlocked /
    exported_pool_drive_unlocked.
  * Grants are in-memory only — container restart wipes them.
  * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
    Burn-In button. modal_unlock.html with type-to-confirm field.
    Live unlock countdown via tickUnlockCountdowns() in app.js.
  * Daily report: red banner listing every unlock event from the last
    24h, with operator + reason + timestamp.

1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
  * ssh_client.get_pool_membership / get_zfs_member_drives now return
    None on failure (vs {} for 'definitely empty'). poller passes
    update_pool=False to _upsert_drive on detection failure, preserving
    existing pool columns instead of clearing them. Without this fix a
    1-second SSH blip silently unlocked every drive.
  * mailer._build_unlock_banner_html escapes every interpolated field
    via html.escape() (was '<' only). Time filter switched to
    julianday() — string >= against datetime('now', '-1 day') compared
    formats with different separators ('T' vs ' ') and timezone
    suffixes, causing subtle off-by-N-hour inclusion.
  * app.js submitStart/submitBatchStart now detect the structured
    pool_locked 409 detail and auto-open the unlock modal for the
    offending drive (was [object Object] in toast).

1.0.0-18 — Codex grant-binding + commit-ordering fixes.
  * Unlock grants bound to the (pool_name, pool_role) observed at unlock
    time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
    invalidate the grant if the live row's pool identity has changed.
    Prevents an 'exported' unlock from carrying over when the drive
    turns out to be in active 'tank' or 'boot-pool'.
  * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
    succeeds — previously a failed audit insert left an unaudited grant
    armed.

1.0.0-19 — Codex race + cancellation classification + test scaffold.
  * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
    (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
    try/except aiosqlite.IntegrityError -> ValueError so the read-then-
    insert race in start_job can't produce two queued rows for the same
    drive.
  * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
    future code paths) where DB state is still 'running', finalizer
    writes 'unknown' instead of mis-classifying as 'failed'.
  * tests/ stdlib unittest scaffold:
    - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
      single-disk depth-1, plural section markers, partition stripping,
      sdaa-style names, multi-pool, role reset between pools.
    - test_unlock_flow.py (18 tests): token validation per pool kind,
      identity-binding invalidation, TTL expiry, audit-commit-then-arm
      ordering, unique-active-burnin partial index.
    Run via `python -m unittest discover tests/`. No new dependencies.

1.0.0-20 — Spearfoot-inspired badblocks tunables.
  * surface_validate_block_size (-b, default 4096), surface_validate_
    block_buffer (-c, default 64), surface_validate_passes (-p, default
    1) exposed in Settings UI; persist via settings_store.json.
    Validation: block size must be a power of 2 between 512 and
    1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
    roughly halves runtime on multi-TB HDDs at ~2x RAM cost.

1.0.0-21 — SMART overall-health column actually populated.
  * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
    to UNKNOWN forever (only burn-in stages ever wrote a real value).
  * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
    all drives in a single SSH session, deterministically delimited with
    @@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
    UNKNOWN} or None on SSH failure.
  * poller calls it every 5th cycle (~1 min at default 12s interval),
    caches in _state['smart_health_cache'] so transient failures preserve
    the previous values.
  * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
    -> 6 so Short/Long SMART columns fit comfortably on a 13-inch
    display.
  * 5 additional parser tests (44 total, all passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:25:56 -04:00

535 lines
22 KiB
Python

"""
Daily status email — sent at smtp_report_hour (local time) every day.
Disabled when SMTP_HOST is not set.
"""
import asyncio
import html
import logging
import smtplib
import ssl
from datetime import datetime, timedelta, timezone
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import aiosqlite
from app.config import settings
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# HTML email template
# ---------------------------------------------------------------------------
def _chip(state: str) -> str:
colours = {
"PASSED": ("#1a4731", "#3fb950", "#3fb950"),
"passed": ("#1a4731", "#3fb950", "#3fb950"),
"FAILED": ("#4b1113", "#f85149", "#f85149"),
"failed": ("#4b1113", "#f85149", "#f85149"),
"running": ("#0d2d6b", "#58a6ff", "#58a6ff"),
"queued": ("#4b3800", "#d29922", "#d29922"),
"cancelled": ("#222", "#8b949e", "#8b949e"),
"unknown": ("#222", "#8b949e", "#8b949e"),
"idle": ("#222", "#8b949e", "#8b949e"),
"UNKNOWN": ("#222", "#8b949e", "#8b949e"),
}
bg, fg, bd = colours.get(state, ("#222", "#8b949e", "#8b949e"))
label = state.upper()
return (
f'<span style="background:{bg};color:{fg};border:1px solid {bd};'
f'border-radius:4px;padding:2px 8px;font-size:11px;font-weight:600;'
f'letter-spacing:.04em;white-space:nowrap">{label}</span>'
)
def _temp_colour(c) -> str:
if c is None:
return "#8b949e"
if c < 40:
return "#3fb950"
if c < 50:
return "#d29922"
return "#f85149"
def _fmt_bytes(b) -> str:
if b is None:
return ""
tb = b / 1_000_000_000_000
if tb >= 1:
return f"{tb:.0f} TB"
return f"{b / 1_000_000_000:.0f} GB"
def _fmt_dt(iso: str | None) -> str:
if not iso:
return ""
try:
dt = datetime.fromisoformat(iso)
if dt.tzinfo is None:
dt = dt.replace(tzinfo=timezone.utc)
return dt.astimezone().strftime("%Y-%m-%d %H:%M")
except Exception:
return iso or ""
def _drive_rows_html(drives: list[dict]) -> str:
if not drives:
return '<tr><td colspan="8" style="text-align:center;color:#8b949e;padding:24px">No drives found</td></tr>'
rows = []
for d in drives:
health = d.get("smart_health") or "UNKNOWN"
temp = d.get("temperature_c")
bi = d.get("burnin") or {}
bi_state = bi.get("state", "") if bi else ""
short = d.get("smart_short") or {}
long_ = d.get("smart_long") or {}
short_state = short.get("state", "idle")
long_state = long_.get("state", "idle")
row_bg = "#1c0a0a" if health == "FAILED" else "#0d1117"
rows.append(f"""
<tr style="background:{row_bg};border-bottom:1px solid #30363d">
<td style="padding:9px 12px;font-weight:600;color:#c9d1d9">{d.get('devname','')}</td>
<td style="padding:9px 12px;color:#8b949e;font-size:12px">{d.get('model','')}</td>
<td style="padding:9px 12px;font-family:monospace;font-size:12px;color:#8b949e">{d.get('serial','')}</td>
<td style="padding:9px 12px;text-align:right;color:#8b949e">{_fmt_bytes(d.get('size_bytes'))}</td>
<td style="padding:9px 12px;text-align:right;color:{_temp_colour(temp)};font-weight:500">{f'{temp}°C' if temp is not None else ''}</td>
<td style="padding:9px 12px">{_chip(health)}</td>
<td style="padding:9px 12px">{_chip(short_state)}</td>
<td style="padding:9px 12px">{_chip(long_state)}</td>
<td style="padding:9px 12px">{_chip(bi_state) if bi else ''}</td>
</tr>""")
return "\n".join(rows)
def _build_unlock_banner_html(events: list[dict]) -> str:
"""Banner listing every pool-drive unlock granted in the last 24h.
Every interpolated DB field is run through html.escape — operator and
reason are free-text from the unlock modal and otherwise inject into
the email body verbatim.
"""
if not events:
return ""
rows = []
for e in events:
evt = e.get("event_type") or ""
is_boot = evt == "boot_pool_drive_unlocked"
is_exported = evt == "exported_pool_drive_unlocked"
kind = (
"BOOT POOL" if is_boot
else "EXPORTED ZFS" if is_exported
else "pool"
)
when = html.escape((e.get("created_at") or "")[:19])
operator = html.escape(e.get("operator") or "?")
devname = html.escape(e.get("devname") or "?")
# `message` already includes pool name, devname, and the operator's
# reason — surface it verbatim so the audit trail is faithful.
message = html.escape(e.get("message") or "")
rows.append(
f"<li style='margin:4px 0'><strong>{when}</strong> &middot; "
f"<strong>{operator}</strong> unlocked a {kind} drive "
f"({devname}): "
f"<span style='color:#c9d1d9'>{message}</span></li>"
)
return f"""
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;
padding:14px 18px;margin-bottom:20px;color:#f85149">
<div style="font-weight:600;font-size:14px;margin-bottom:6px">
&#x26A0; {len(events)} pool-drive unlock(s) in the last 24h
</div>
<ul style="margin:0;padding-left:18px;font-size:12.5px;color:#f0a0a0">
{''.join(rows)}
</ul>
</div>"""
def _build_html(drives: list[dict], generated_at: str,
unlock_events: list[dict] | None = None) -> str:
total = len(drives)
failed_drives = [d for d in drives if d.get("smart_health") == "FAILED"]
running_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "running"]
passed_burnin = [d for d in drives if (d.get("burnin") or {}).get("state") == "passed"]
# Alert banners (unlock events first — the audit-grade signal)
alert_html = _build_unlock_banner_html(unlock_events or [])
if failed_drives:
names = ", ".join(d["devname"] for d in failed_drives)
alert_html += f"""
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;padding:14px 18px;margin-bottom:20px;color:#f85149;font-weight:500">
⚠ SMART health FAILED on {len(failed_drives)} drive(s): {names}
</div>"""
drive_rows = _drive_rows_html(drives)
return f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width,initial-scale=1">
<title>TrueNAS Burn-In — Daily Report</title>
</head>
<body style="margin:0;padding:0;background:#0d1117;font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',system-ui,sans-serif;font-size:14px;color:#c9d1d9">
<table width="100%" cellpadding="0" cellspacing="0" style="background:#0d1117;min-height:100vh">
<tr><td align="center" style="padding:32px 16px">
<table width="700" cellpadding="0" cellspacing="0" style="max-width:700px;width:100%">
<!-- Header -->
<tr>
<td style="background:#161b22;border:1px solid #30363d;border-radius:10px 10px 0 0;padding:20px 24px;border-bottom:none">
<table width="100%" cellpadding="0" cellspacing="0">
<tr>
<td><span style="font-size:18px;font-weight:700;color:#f0f6fc">TrueNAS Burn-In</span>
<span style="color:#8b949e;font-size:13px;margin-left:10px">Daily Status Report</span></td>
<td align="right" style="color:#8b949e;font-size:12px">{generated_at}</td>
</tr>
</table>
</td>
</tr>
<!-- Body -->
<tr>
<td style="background:#0d1117;border:1px solid #30363d;border-top:none;border-bottom:none;padding:24px">
{alert_html}
<!-- Summary chips -->
<table cellpadding="0" cellspacing="0" style="margin-bottom:24px">
<tr>
<td style="padding-right:10px">
<div style="background:#161b22;border:1px solid #30363d;border-radius:8px;padding:12px 18px;text-align:center;min-width:80px">
<div style="font-size:24px;font-weight:700;color:#f0f6fc">{total}</div>
<div style="font-size:11px;color:#8b949e;text-transform:uppercase;letter-spacing:.06em;margin-top:2px">Drives</div>
</div>
</td>
<td style="padding-right:10px">
<div style="background:#161b22;border:1px solid #30363d;border-radius:8px;padding:12px 18px;text-align:center;min-width:80px">
<div style="font-size:24px;font-weight:700;color:#f85149">{len(failed_drives)}</div>
<div style="font-size:11px;color:#8b949e;text-transform:uppercase;letter-spacing:.06em;margin-top:2px">Failed</div>
</div>
</td>
<td style="padding-right:10px">
<div style="background:#161b22;border:1px solid #30363d;border-radius:8px;padding:12px 18px;text-align:center;min-width:80px">
<div style="font-size:24px;font-weight:700;color:#58a6ff">{len(running_burnin)}</div>
<div style="font-size:11px;color:#8b949e;text-transform:uppercase;letter-spacing:.06em;margin-top:2px">Running</div>
</div>
</td>
<td>
<div style="background:#161b22;border:1px solid #30363d;border-radius:8px;padding:12px 18px;text-align:center;min-width:80px">
<div style="font-size:24px;font-weight:700;color:#3fb950">{len(passed_burnin)}</div>
<div style="font-size:11px;color:#8b949e;text-transform:uppercase;letter-spacing:.06em;margin-top:2px">Passed</div>
</div>
</td>
</tr>
</table>
<!-- Drive table -->
<table width="100%" cellpadding="0" cellspacing="0" style="border:1px solid #30363d;border-radius:8px;overflow:hidden">
<thead>
<tr style="background:#161b22">
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Drive</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Model</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Serial</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:right;border-bottom:1px solid #30363d">Size</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:right;border-bottom:1px solid #30363d">Temp</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Health</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Short</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Long</th>
<th style="padding:9px 12px;font-size:11px;font-weight:600;text-transform:uppercase;letter-spacing:.06em;color:#8b949e;text-align:left;border-bottom:1px solid #30363d">Burn-In</th>
</tr>
</thead>
<tbody>
{drive_rows}
</tbody>
</table>
</td>
</tr>
<!-- Footer -->
<tr>
<td style="background:#161b22;border:1px solid #30363d;border-top:none;border-radius:0 0 10px 10px;padding:14px 24px;text-align:center">
<span style="font-size:12px;color:#8b949e">Generated by TrueNAS Burn-In Dashboard · {generated_at}</span>
</td>
</tr>
</table>
</td></tr>
</table>
</body>
</html>"""
# ---------------------------------------------------------------------------
# Send
# ---------------------------------------------------------------------------
# Standard ports for each SSL mode — used when smtp_port is not overridden
_MODE_PORTS: dict[str, int] = {"starttls": 587, "ssl": 465, "plain": 25}
def _smtp_port() -> int:
"""Derive port from ssl_mode; fall back to settings.smtp_port if explicitly set."""
mode = (settings.smtp_ssl_mode or "starttls").lower()
return _MODE_PORTS.get(mode, 587)
def _send_email(subject: str, html: str) -> None:
recipients = [r.strip() for r in settings.smtp_to.split(",") if r.strip()]
if not recipients:
log.warning("SMTP_TO is empty — skipping send")
return
msg = MIMEMultipart("alternative")
msg["Subject"] = subject
msg["From"] = settings.smtp_from or settings.smtp_user
msg["To"] = ", ".join(recipients)
msg.attach(MIMEText(html, "html", "utf-8"))
ctx = ssl.create_default_context()
mode = (settings.smtp_ssl_mode or "starttls").lower()
timeout = int(settings.smtp_timeout or 60)
port = _smtp_port()
if mode == "ssl":
server = smtplib.SMTP_SSL(settings.smtp_host, port, context=ctx, timeout=timeout)
server.ehlo()
server.login(settings.smtp_user, settings.smtp_password)
server.sendmail(msg["From"], recipients, msg.as_string())
server.quit()
else:
with smtplib.SMTP(settings.smtp_host, port, timeout=timeout) as server:
server.ehlo()
if mode == "starttls":
server.starttls(context=ctx)
server.ehlo()
server.login(settings.smtp_user, settings.smtp_password)
server.sendmail(msg["From"], recipients, msg.as_string())
log.info("Email sent to %s", recipients)
# ---------------------------------------------------------------------------
# Data fetch
# ---------------------------------------------------------------------------
async def _fetch_report_data() -> list[dict]:
"""Pull drives + latest burnin state from DB."""
from app.routes import _fetch_drives_for_template # local import avoids circular
async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL")
return await _fetch_drives_for_template(db)
async def _fetch_unlock_events_24h() -> list[dict]:
"""Return pool-drive unlock audit events from the last 24 hours.
These are operator overrides of the pool-membership lock — every entry
represents a deliberate decision to risk a pool, so the daily report
surfaces them as an audit-grade banner.
"""
async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL")
# julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format
# we write from Python; comparing the raw string against
# datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS')
# produces subtle off-by-up-to-a-day errors because of the
# 'T' vs ' ' separator and the '+00:00' suffix.
cur = await db.execute("""
SELECT ae.event_type, ae.operator, ae.message, ae.created_at,
d.devname, d.pool_name, d.pool_role
FROM audit_events ae
LEFT JOIN drives d ON d.id = ae.drive_id
WHERE ae.event_type IN (
'pool_drive_unlocked',
'boot_pool_drive_unlocked',
'exported_pool_drive_unlocked')
AND julianday(ae.created_at) >= julianday('now', '-1 day')
ORDER BY ae.created_at DESC
""")
return [dict(r) for r in await cur.fetchall()]
# ---------------------------------------------------------------------------
# Scheduler
# ---------------------------------------------------------------------------
def _build_alert_html(
job_id: int,
devname: str,
serial: str | None,
model: str | None,
state: str,
error_text: str | None,
generated_at: str,
) -> str:
is_fail = state == "failed"
color = "#f85149" if is_fail else "#3fb950"
bg = "#4b1113" if is_fail else "#1a4731"
icon = "&#x2715;" if is_fail else "&#x2713;"
error_section = ""
if error_text:
error_section = f"""
<div style="background:#4b1113;border:1px solid #f85149;border-radius:6px;
padding:12px 16px;margin-top:16px;color:#f85149;font-size:13px">
<strong>Error:</strong> {error_text}
</div>"""
return f"""<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>Burn-In {state.title()} Alert</title></head>
<body style="margin:0;padding:0;background:#0d1117;font-family:-apple-system,sans-serif;
font-size:14px;color:#c9d1d9">
<table width="100%" cellpadding="0" cellspacing="0">
<tr><td align="center" style="padding:32px 16px">
<table width="480" cellpadding="0" cellspacing="0" style="max-width:480px;width:100%">
<tr>
<td style="background:{bg};border:2px solid {color};border-radius:10px;padding:24px">
<div style="font-size:26px;font-weight:700;color:{color};margin-bottom:16px">
{icon} Burn-In {state.upper()}
</div>
<table cellpadding="0" cellspacing="0" style="width:100%">
<tr>
<td style="color:#8b949e;font-size:12px;padding:5px 0">Device</td>
<td style="font-weight:600;text-align:right;font-size:15px">{devname}</td>
</tr>
<tr>
<td style="color:#8b949e;font-size:12px;padding:5px 0">Model</td>
<td style="text-align:right">{model or ''}</td>
</tr>
<tr>
<td style="color:#8b949e;font-size:12px;padding:5px 0">Serial</td>
<td style="font-family:monospace;text-align:right">{serial or ''}</td>
</tr>
<tr>
<td style="color:#8b949e;font-size:12px;padding:5px 0">Job #</td>
<td style="font-family:monospace;text-align:right">{job_id}</td>
</tr>
</table>
{error_section}
<div style="margin-top:16px;font-size:11px;color:#8b949e">{generated_at}</div>
</td>
</tr>
</table>
</td></tr>
</table>
</body>
</html>"""
async def send_job_alert(
job_id: int,
devname: str,
serial: str | None,
model: str | None,
state: str,
error_text: str | None,
) -> None:
"""Send an immediate per-job alert email (pass or fail)."""
icon = "" if state == "failed" else ""
subject = f"{icon} Burn-In {state.upper()}: {devname} ({serial or 'no serial'})"
now_str = datetime.now().strftime("%Y-%m-%d %H:%M")
html = _build_alert_html(job_id, devname, serial, model, state, error_text, now_str)
await asyncio.to_thread(_send_email, subject, html)
async def test_smtp_connection() -> dict:
"""
Try to establish an SMTP connection using current settings.
Returns {"ok": True/False, "error": str|None}.
Does NOT send any email.
"""
if not settings.smtp_host:
return {"ok": False, "error": "SMTP_HOST is not configured"}
def _test() -> dict:
try:
ctx = ssl.create_default_context()
mode = (settings.smtp_ssl_mode or "starttls").lower()
timeout = int(settings.smtp_timeout or 60)
port = _smtp_port()
if mode == "ssl":
server = smtplib.SMTP_SSL(settings.smtp_host, port,
context=ctx, timeout=timeout)
server.ehlo()
else:
server = smtplib.SMTP(settings.smtp_host, port, timeout=timeout)
server.ehlo()
if mode == "starttls":
server.starttls(context=ctx)
server.ehlo()
if settings.smtp_user:
server.login(settings.smtp_user, settings.smtp_password)
server.quit()
return {"ok": True, "error": None}
except Exception as exc:
return {"ok": False, "error": str(exc)}
return await asyncio.to_thread(_test)
async def send_report_now() -> None:
"""Send a report immediately (used by on-demand API endpoint)."""
drives = await _fetch_report_data()
unlock_events = await _fetch_unlock_events_24h()
now_str = datetime.now().strftime("%Y-%m-%d %H:%M")
html = _build_html(drives, now_str, unlock_events)
suffix = ""
if unlock_events:
suffix = f"{len(unlock_events)} pool unlock(s)"
subject = (
f"Burn-In Report — {datetime.now().strftime('%Y-%m-%d')} "
f"({len(drives)} drives){suffix}"
)
await asyncio.to_thread(_send_email, subject, html)
async def run() -> None:
"""Background loop: send daily report at smtp_report_hour local time."""
if not settings.smtp_host:
log.info("SMTP not configured — daily email disabled")
return
log.info(
"Mailer started — daily report at %02d:00 local time",
settings.smtp_report_hour,
)
while True:
now = datetime.now()
target = now.replace(
hour=settings.smtp_report_hour,
minute=0, second=0, microsecond=0,
)
if target <= now:
target += timedelta(days=1)
wait = (target - now).total_seconds()
log.info("Next report in %.0f seconds (%s)", wait, target.strftime("%Y-%m-%d %H:%M"))
await asyncio.sleep(wait)
if settings.smtp_daily_report_enabled:
try:
await send_report_now()
except Exception as exc:
log.error("Failed to send daily report: %s", exc)
else:
log.info("Daily report skipped — smtp_daily_report_enabled is False")
# Sleep briefly past the hour to avoid drift from re-triggering immediately
await asyncio.sleep(60)