dcw3: dcc1 container veth detaches from br0 across reboot/snap-lxd-restart, kills LAN egress #4
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
After dcw3 host reboot or
snap restart lxd, the dcc1 container loses LAN egress: cannot reach 192.168.88.1, no DNS, no apt/pacman. dcc1'seth0isUP/LOWER_UPwith the right IP and MAC, but ARP for the gateway gets no replies. vpntest (the other LXD container on the same host) is unaffected.Root cause
dcc1's host-side veth ends up detached from
br0, while vpntest's is correctlymaster br0:This was caught 2026-04-29 trying to wire dcc1 onto the marfrit pacman repo (DNS resolution failed, debug led to bridge inspection).
One-line live fix
…then
incus exec dcc1 -- ip neigh flush alland the container has full LAN connectivity again.Reproducer (suspected)
Either of:
sudo snap restart lxdon dcw3.The veth is created by the snap-managed LXD daemon during container start, but the bridge
br0is owned by NetworkManager. Race / ownership conflict on attach.Environment
6.12.47+rpt-rpi-v85.21.4-aee7e08via snap (channel5.21/stable)257 (257.9-1~deb13u1)br0(NM connection profile)defaultusesnictype: bridged, parent: br0Why dcc1 specifically
Both dcc1 and vpntest use the same default profile pointing at
br0. vpntest's veth attaches correctly. dcc1's doesn't. The difference is likely start order — whichever container starts first while NM is still settling onbr0loses. dcc1 is the heavier container (Arch + distcc-avahi), starts second; vpntest is leaner, starts first; result: dcc1's veth is created when NM is still fiddling and the LXD-sidemaster br0setting silently fails (no error inlxc info dcc1 --show-log).Candidate fixes
br0, define it viasystemd-networkdor plain/etc/network/interfaces(deterministic ordering, no async settle).snap.lxd.daemon.serviceand reattaches any LXD veth not inbr0. Crude but bulletproof:boot.host_shutdown_timeout/raw.lxchooks that explicitly attach veth post-start.(1) is the proper fix. (2) is the safety net we can ship now. (3) is fragile.
Workaround documentation
Not yet in
agents/his.md/skills/his/SKILL.mdrunbook. Should land alongside whichever fix path we pick.Priority
Medium. Bites every time dcw3 reboots; user noticed during a marfrit-repo wiring attempt.
Safety-net (option 2) shipped on dcw3 2026-05-18
Installed
/etc/systemd/system/lxd-veth-rebridge.service— oneshot After=snap.lxd.daemon.service, ExecStartPre=sleep 5, ExecStart re-attaches any orphanveth*interface to br0:systemctl enable --now lxd-veth-rebridge.servicesucceeded; the unit fires aftersnap.lxd.daemon.servicesettles.Live test
Started dcc1 — veth landed on br0 automatically (the race didn't trigger this session), dcc1 pings 192.168.88.1 in 0.5ms. Pre-existing dcc1 connectivity unaffected. Will get its real workout on the next dcw3 reboot.
Runbook updated
Appended to
/opt/his-context/agent.mdwith the dcw3 veth/br0 entry + the one-line live fix for mid-life incidents.Closing as fixed via safety-net
Proper fix (option 1 — drop NetworkManager management of br0, switch to systemd-networkd or
/etc/network/interfacesfor deterministic ordering) deferred. Will open as a separate ticket if/when the safety-net proves insufficient.