Nodes lose wireguard after they rebootet and are uncordoned #31
Labels
No labels
system
CNI
system
Common
system
Core Deployments
system
Infrastructure
system
Storage
type
bug
type
enhancement
type
question
type
update
type
dependency-dashboard
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
yolokube/ansible#31
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Nodes lose wireguard after they rebootet and are uncordoned
ufw disabledoes not fix the issuewg-quick down wg0 && wg-quick up wg0does not fix the issuelogs of cilium-agent on the time of the loss of communication. The Communication via wireguard was lost at about 01:24:06 in this log.
logs of kube-proxy do not show any entries during the time of the crash. Just the loss of communication is logged about 38s later.
ufw logs are interesting... there is an entry about cilium getting blocked
kernel log
daemon.log
Update: I used my test cluster to debug the problem a bit. I guess the problem is caused by Ciliums IPTables rules. This only occurs when IPv6 is enabled. With IPv4 only configuration this problem does not exist.
Makes sense, since Wireguard communicates over the main interface (public IPv6) and only IPTables rules are created when IPv6 is also enabled in Cilium. 🚧
Unfortunately, the node needs to be reinstalled to solve the problem. Deploying the CNI again and deleting the IPTables chains does not solve the problem. 🙈💩
I have rechecked Cilium on the test cluster. The attempt to run "Cilium only on the wireguard interface" did not solve the problem. Maybe Cilium's host policies are the solution to allow the wireguard port. 🤔🤞
A weird workaround would be to drop Cilium and switch back to Flannel after all... 🤔
with flannel it works fine