-
Notifications
You must be signed in to change notification settings - Fork 21
Description
What
Although Coil implements a high-available egress NAT, the connection
tracking states are lost when one of the egress NAT Pods is gone.
Linux tracks connection status by conntrack tables in netfilter, and we can read and edit
conntrack tables via netlink. There is even a program called conntrackd to export and
synchronize conntrack data between two servers.
With this capability, Coil can keep connections on egress NAT between Pod restarts.
How
To switch all connections from one NAT pod to another, Coil has to do a few things.
- The new Pod should take over the global IP address of the old Pod.
- Coil should stop advertising the global IP on the node of the old Pod and start it on the node of the new Pod.
This means that Coil should not assign the global IP address to the Pod.
Instead, Coil should assign a normal cluster-internal IP address to NAT Pods
and give them extra global IP addresses for NAT use. Those global IP addresses
float between NAT Pods, so we can call them floating addresses.
Below is a summary of the necessary changes.
We need a detailed design doc still.
- Define a pool of floating addresses for egress NAT.
- Assign floating addresses to egress NAT Pods and program routing.
- Reprogram routing when the owner of a floating address is changed.
- One idea is to change the Service endpoints.
- Another idea is to get rid of Service for egress Pods and program routing in each client Pod.
- Appropriately advertise floating addresses for the current owner Pods.
- Implement some fast health-checking for failed Pods.
- Often used are VRRP or BFD, but we can use any protocol.
- Synchronize the conntrack status between egress NAT Pods
- We may use conntrackd or do it ourselves in Go with https://github.com/vishvananda/netlink/blob/v1.3.0/conntrack_linux.go
Checklist
- Finish implementation of the issue
- Test all functions
- Have enough logs to trace activities
- Notify developers of necessary actions