dcw3: dcc1 container veth detaches from br0 across reboot/snap-lxd-restart, kills LAN egress #4

Closed
opened 2026-04-29 20:54:24 +00:00 by marfrit · 1 comment
Owner

Symptom

After dcw3 host reboot or snap restart lxd, the dcc1 container loses LAN egress: cannot reach 192.168.88.1, no DNS, no apt/pacman. dcc1's eth0 is UP/LOWER_UP with the right IP and MAC, but ARP for the gateway gets no replies. vpntest (the other LXD container on the same host) is unaffected.

Root cause

dcc1's host-side veth ends up detached from br0, while vpntest's is correctly master br0:

ip link show veth8c828c37
# 7: veth8c828c37@if6: ... state UP mode DEFAULT
#                       ^^^ no `master br0`

ip link show veth5943bb1c
# 9: veth5943bb1c@if8: ... master br0 state UP

This was caught 2026-04-29 trying to wire dcc1 onto the marfrit pacman repo (DNS resolution failed, debug led to bridge inspection).

One-line live fix

sudo ip link set veth8c828c37 master br0

…then incus exec dcc1 -- ip neigh flush all and the container has full LAN connectivity again.

Reproducer (suspected)

Either of:

  • Reboot dcw3.
  • sudo snap restart lxd on dcw3.

The veth is created by the snap-managed LXD daemon during container start, but the bridge br0 is owned by NetworkManager. Race / ownership conflict on attach.

Environment

  • dcw3: Pi 4, Debian 13 trixie, kernel 6.12.47+rpt-rpi-v8
  • LXD 5.21.4-aee7e08 via snap (channel 5.21/stable)
  • systemd 257 (257.9-1~deb13u1)
  • NetworkManager-managed br0 (NM connection profile)
  • LXD profile default uses nictype: bridged, parent: br0

Why dcc1 specifically

Both dcc1 and vpntest use the same default profile pointing at br0. vpntest's veth attaches correctly. dcc1's doesn't. The difference is likely start order — whichever container starts first while NM is still settling on br0 loses. dcc1 is the heavier container (Arch + distcc-avahi), starts second; vpntest is leaner, starts first; result: dcc1's veth is created when NM is still fiddling and the LXD-side master br0 setting silently fails (no error in lxc info dcc1 --show-log).

Candidate fixes

  1. Drop NetworkManager management of br0, define it via systemd-networkd or plain /etc/network/interfaces (deterministic ordering, no async settle).
  2. systemd unit on dcw3 that runs after snap.lxd.daemon.service and reattaches any LXD veth not in br0. Crude but bulletproof:
    [Service]
    Type=oneshot
    ExecStart=/bin/sh -c 'for v in $(ls /sys/class/net | grep ^veth); do bridge link show dev $v | grep -q "master br0" || ip link set $v master br0; done'
    
  3. LXD config workaround — set per-container boot.host_shutdown_timeout / raw.lxc hooks that explicitly attach veth post-start.

(1) is the proper fix. (2) is the safety net we can ship now. (3) is fragile.

Workaround documentation

Not yet in agents/his.md / skills/his/SKILL.md runbook. Should land alongside whichever fix path we pick.

Priority

Medium. Bites every time dcw3 reboots; user noticed during a marfrit-repo wiring attempt.

## Symptom After dcw3 host reboot or `snap restart lxd`, the **dcc1 container loses LAN egress**: cannot reach 192.168.88.1, no DNS, no apt/pacman. dcc1's `eth0` is `UP/LOWER_UP` with the right IP and MAC, but ARP for the gateway gets no replies. vpntest (the other LXD container on the same host) is unaffected. ## Root cause dcc1's host-side veth ends up **detached from `br0`**, while vpntest's is correctly `master br0`: ``` ip link show veth8c828c37 # 7: veth8c828c37@if6: ... state UP mode DEFAULT # ^^^ no `master br0` ip link show veth5943bb1c # 9: veth5943bb1c@if8: ... master br0 state UP ``` This was caught 2026-04-29 trying to wire dcc1 onto the marfrit pacman repo (DNS resolution failed, debug led to bridge inspection). ## One-line live fix ```sh sudo ip link set veth8c828c37 master br0 ``` …then `incus exec dcc1 -- ip neigh flush all` and the container has full LAN connectivity again. ## Reproducer (suspected) Either of: - Reboot dcw3. - `sudo snap restart lxd` on dcw3. The veth is created by the snap-managed LXD daemon during container start, but the bridge `br0` is owned by **NetworkManager**. Race / ownership conflict on attach. ## Environment - dcw3: Pi 4, Debian 13 trixie, kernel `6.12.47+rpt-rpi-v8` - LXD `5.21.4-aee7e08` via snap (channel `5.21/stable`) - systemd `257 (257.9-1~deb13u1)` - NetworkManager-managed `br0` (NM connection profile) - LXD profile `default` uses `nictype: bridged, parent: br0` ## Why dcc1 specifically Both dcc1 and vpntest use the same default profile pointing at `br0`. vpntest's veth attaches correctly. dcc1's doesn't. The difference is likely **start order** — whichever container starts first while NM is still settling on `br0` loses. dcc1 is the heavier container (Arch + distcc-avahi), starts second; vpntest is leaner, starts first; result: dcc1's veth is created when NM is still fiddling and the LXD-side `master br0` setting silently fails (no error in `lxc info dcc1 --show-log`). ## Candidate fixes 1. **Drop NetworkManager management of `br0`**, define it via `systemd-networkd` or plain `/etc/network/interfaces` (deterministic ordering, no async settle). 2. **systemd unit on dcw3** that runs after `snap.lxd.daemon.service` and reattaches any LXD veth not in `br0`. Crude but bulletproof: ``` [Service] Type=oneshot ExecStart=/bin/sh -c 'for v in $(ls /sys/class/net | grep ^veth); do bridge link show dev $v | grep -q "master br0" || ip link set $v master br0; done' ``` 3. **LXD config workaround** — set per-container `boot.host_shutdown_timeout` / `raw.lxc` hooks that explicitly attach veth post-start. (1) is the proper fix. (2) is the safety net we can ship now. (3) is fragile. ## Workaround documentation Not yet in `agents/his.md` / `skills/his/SKILL.md` runbook. Should land alongside whichever fix path we pick. ## Priority Medium. Bites every time dcw3 reboots; user noticed during a marfrit-repo wiring attempt.
Author
Owner

Safety-net (option 2) shipped on dcw3 2026-05-18

Installed /etc/systemd/system/lxd-veth-rebridge.service — oneshot After=snap.lxd.daemon.service, ExecStartPre=sleep 5, ExecStart re-attaches any orphan veth* interface to br0:

for v in $(ls /sys/class/net 2>/dev/null | grep ^veth); do
  bridge link show dev "$v" 2>/dev/null | grep -q "master br0" || ip link set "$v" master br0
done

systemctl enable --now lxd-veth-rebridge.service succeeded; the unit fires after snap.lxd.daemon.service settles.

Live test

Started dcc1 — veth landed on br0 automatically (the race didn't trigger this session), dcc1 pings 192.168.88.1 in 0.5ms. Pre-existing dcc1 connectivity unaffected. Will get its real workout on the next dcw3 reboot.

Runbook updated

Appended to /opt/his-context/agent.md with the dcw3 veth/br0 entry + the one-line live fix for mid-life incidents.

Closing as fixed via safety-net

Proper fix (option 1 — drop NetworkManager management of br0, switch to systemd-networkd or /etc/network/interfaces for deterministic ordering) deferred. Will open as a separate ticket if/when the safety-net proves insufficient.

## Safety-net (option 2) shipped on dcw3 2026-05-18 Installed `/etc/systemd/system/lxd-veth-rebridge.service` — oneshot After=snap.lxd.daemon.service, ExecStartPre=sleep 5, ExecStart re-attaches any orphan `veth*` interface to br0: ``` for v in $(ls /sys/class/net 2>/dev/null | grep ^veth); do bridge link show dev "$v" 2>/dev/null | grep -q "master br0" || ip link set "$v" master br0 done ``` `systemctl enable --now lxd-veth-rebridge.service` succeeded; the unit fires after `snap.lxd.daemon.service` settles. ## Live test Started dcc1 — veth landed on br0 automatically (the race didn't trigger this session), dcc1 pings 192.168.88.1 in 0.5ms. Pre-existing dcc1 connectivity unaffected. Will get its real workout on the next dcw3 reboot. ## Runbook updated Appended to `/opt/his-context/agent.md` with the dcw3 veth/br0 entry + the one-line live fix for mid-life incidents. ## Closing as fixed via safety-net Proper fix (option 1 — drop NetworkManager management of br0, switch to systemd-networkd or `/etc/network/interfaces` for deterministic ordering) deferred. Will open as a separate ticket if/when the safety-net proves insufficient.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: marfrit/claude-his-agent#4