fahrenheit/pihole-FTL: silent dashboard outage when restart hits port-bind conflict — needs watchdog #2

Closed
opened 2026-04-29 04:42:00 +00:00 by marfrit · 1 comment
Owner

What happened

2026-04-29 ~04:25 UTC: Pi-hole admin UI was unreachable. DNS still working. No alerts had fired.

Root cause

pihole-FTL running since 2026-04-23 picked up a respawn-cycle on 2026-04-24 09:28 UTC. At rebind time, ports 80/443 were briefly held by something else (likely the previous FTL instance — see the sibling bug on supervise-daemon orphan reaping). civetweb logged:

[2026-04-24 09:28:21.880 UTC 496] cannot bind to 80o: 98 (Address in use)
[2026-04-24 09:28:21.880 UTC 496] cannot bind to 443os: 98 (Address in use)

The o/os modifiers in webserver.port = "80o,443os,[::]:80o,[::]:443os" mean optional bind — civetweb logs and gives up. There is no retry. FTL kept running in DNS-only mode for ~5 days before anyone noticed.

The failure is also invisible from a casual look:

  • Main /var/log/pihole/FTL.log has no mention of it.
  • pihole status shows ✓ for FTL listening on 53.
  • webserver.log is the only place it surfaces, and that file isn't on any monitoring path.

Suggested watchdog

Minimal cron on hertz (or inside the container) that pokes the embedded server and bounces FTL on failure:

*/5 * * * * root /snap/bin/lxc exec fahrenheit -- sh -c 'curl -sfo /dev/null --max-time 3 http://127.0.0.1/admin/login || rc-service pihole-FTL restart'

Alternatively, fold into health-check.sh on hertz alongside the existing container/cert/disk/thermal/battery checks — it's the same shape of "poke endpoint, alert/repair on failure".

Lesser fix (config-only)

Drop the o/os modifiers from webserver.port so a bind failure is loud (civetweb refuses to start, supervise-daemon respawns). Risk: if the conflict is transient and self-healing, FTL would respawn-loop. The watchdog is more robust.

Cleanup that worked

  • kill -9 <orphan-pid> (SIGTERM was ignored — see sibling bug)
  • rc-service pihole-FTL restart (to make the new instance retry the now-free 80/443)
  • Verified: single PID binding 53/80/443 in both v4 and v6, /admin/ returns 302.

Evidence trail

  • webserver.log.1 2026-04-24 09:28:21.880 UTC: bind errors
  • FTL.log 2026-04-23 .. 2026-04-29: no web-related entries
  • ss -tlnp mid-incident: only pihole-FTL on 53, nothing on 80/443
  • Resolution time: ~5 minutes once diagnosed; dashboard had been down ~5 days.
## What happened 2026-04-29 ~04:25 UTC: Pi-hole admin UI was unreachable. DNS still working. No alerts had fired. ## Root cause `pihole-FTL` running since 2026-04-23 picked up a respawn-cycle on 2026-04-24 09:28 UTC. At rebind time, ports 80/443 were briefly held by something else (likely the *previous* FTL instance — see the sibling bug on supervise-daemon orphan reaping). civetweb logged: ``` [2026-04-24 09:28:21.880 UTC 496] cannot bind to 80o: 98 (Address in use) [2026-04-24 09:28:21.880 UTC 496] cannot bind to 443os: 98 (Address in use) ``` The `o`/`os` modifiers in `webserver.port = "80o,443os,[::]:80o,[::]:443os"` mean **optional bind** — civetweb logs and gives up. **There is no retry.** FTL kept running in DNS-only mode for ~5 days before anyone noticed. The failure is also invisible from a casual look: - Main `/var/log/pihole/FTL.log` has no mention of it. - `pihole status` shows ✓ for FTL listening on 53. - `webserver.log` is the only place it surfaces, and that file isn't on any monitoring path. ## Suggested watchdog Minimal cron on hertz (or inside the container) that pokes the embedded server and bounces FTL on failure: ```cron */5 * * * * root /snap/bin/lxc exec fahrenheit -- sh -c 'curl -sfo /dev/null --max-time 3 http://127.0.0.1/admin/login || rc-service pihole-FTL restart' ``` Alternatively, fold into `health-check.sh` on hertz alongside the existing container/cert/disk/thermal/battery checks — it's the same shape of "poke endpoint, alert/repair on failure". ## Lesser fix (config-only) Drop the `o`/`os` modifiers from `webserver.port` so a bind failure is loud (civetweb refuses to start, supervise-daemon respawns). Risk: if the conflict is transient and self-healing, FTL would respawn-loop. The watchdog is more robust. ## Cleanup that worked - `kill -9 <orphan-pid>` (SIGTERM was ignored — see sibling bug) - `rc-service pihole-FTL restart` (to make the *new* instance retry the now-free 80/443) - Verified: single PID binding 53/80/443 in both v4 and v6, `/admin/` returns 302. ## Evidence trail - `webserver.log.1` 2026-04-24 09:28:21.880 UTC: bind errors - `FTL.log` 2026-04-23 .. 2026-04-29: no web-related entries - `ss -tlnp` mid-incident: only `pihole-FTL` on 53, nothing on 80/443 - Resolution time: ~5 minutes once diagnosed; dashboard had been down ~5 days.
Author
Owner

Shipped the lesser fix in v0.1.9. fahrenheit pihole.toml updated live: pihole-FTL --config webserver.port "80,443s,[::]:80,[::]:443s" (verified — running instance keeps old config in memory until next FTL restart, change takes effect then). Next time a port collision happens, civetweb will refuse to start and supervise-daemon respawns instead of giving up silently. Watchdog cron rejected as unneeded with the loud-fail config; the runbook ("fahrenheit / pihole-FTL gotchas" subsection in v0.1.9) documents the gotcha so it does not drift back. Closing.

Shipped the lesser fix in v0.1.9. fahrenheit `pihole.toml` updated live: `pihole-FTL --config webserver.port "80,443s,[::]:80,[::]:443s"` (verified — running instance keeps old config in memory until next FTL restart, change takes effect then). Next time a port collision happens, civetweb will refuse to start and supervise-daemon respawns instead of giving up silently. Watchdog cron rejected as unneeded with the loud-fail config; the runbook ("fahrenheit / pihole-FTL gotchas" subsection in v0.1.9) documents the gotcha so it does not drift back. Closing.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/claude-his-agent#2