Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Controller crashes if ENABLE_NP is false and not specifying a gateway node #4841

Open
SkalaNetworks opened this issue Dec 17, 2024 · 6 comments
Labels
bug Something isn't working need backport subnet

Comments

@SkalaNetworks
Copy link
Contributor

Kube-OVN Version

v1.13/v1.14

Kubernetes Version

v1.31.0

Operation-system/Kernel Version

/

Description

The controller crashes when submitting a new Subnet to the K8S API through the Go SDK if network policies have been disabled.

The reason is a panic due to a nil pointer found here:

policies, err := c.npsLister.List(labels.Everything())

We try to use a npLister, which isn't initialized due to network policies being disabled. The only way to not trigger that call is for the above condition to evaluate to false. My example code doesn't trigger that code if I set the gateway manually, but if I don't specify the gateway, it seems Kube-OVN accepts the Subnet as-is, and a millisecond later edits the Gateway to the first IP in the subnet.

This means there's 2 events in quick succession:

  • Creation (triggered by my code)
  • Modification to set the gateway to the first IP (triggered by Kube-OVN)

And then it crashes. I don't understand how my call from Go code triggers the modification of the gateway, but doing it using kubectl doesn't. If I don't specify the gateway in a YAML file and submit it, no crash happens.

I am unsure if my way of creating Subnets using the SDK is the right one considering this bug, but it's the standard K8S way.

I also don't know why NPs need to be re-enqued if the gateway changes, but adding a condition that fails to check if npLister is nil would be a quick fix for this problem.

Steps To Reproduce

  1. Deploy latest Kube-OVN build
  2. Write go code that creates a Subnet using the Go SDK:
	subnet := v1.Subnet{
		ObjectMeta: metav1.ObjectMeta{Name: "this-crashes-controller"},
		Spec: v1.SubnetSpec{
			Vpc:        "dummy",
			Default:    true,
			CIDRBlock:   "10.60.0.0/24",
			NatOutgoing: false,
			//Gateway:     "10.60.0.1", // We are not specifying a gateway, if we remove the comment, it doesn't crash anymore
			GatewayType: "", 
			Protocol:    "IPv4",
		},
	}

	return kubernetes.KubeOvnClient.KubeovnV1().Subnets().Create(context.Background(), &subnet, metav1.CreateOptions{})
  1. Witness the controller crashing when that code is called

Current Behavior

Controller crashes

Expected Behavior

Controller doesn't crash

@SkalaNetworks SkalaNetworks added the bug Something isn't working label Dec 17, 2024
@dosubot dosubot bot added the subnet label Dec 17, 2024
Copy link

dosubot bot commented Dec 17, 2024

The issue you're encountering is due to the npLister not being initialized when ENABLE_NP is set to false. This results in a nil pointer dereference when the controller attempts to access npLister after a subnet creation event triggers a modification to set the gateway to the first IP in the subnet [1].

When you create a subnet using the Go SDK without specifying a gateway, Kube-OVN automatically modifies the subnet to set the gateway to the first IP in the CIDR block. This triggers two rapid events: the initial creation and a subsequent modification, which leads to the crash because the npLister is accessed without being initialized.

In contrast, when using kubectl, the modification sequence does not trigger the same crash, possibly due to differences in how events are processed or queued in the two methods.

To address this issue, you can implement a nil check for npLister before its usage in the code. This would prevent the crash by ensuring that the code does not attempt to access npLister when it is not initialized. This is a quick fix that can mitigate the problem until a more comprehensive solution is implemented in the codebase.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@oilbeater
Copy link
Collaborator

When kube-ovn-cni initializes the Pod network, it pings the gateway address to ensure the network is ready. If a NetworkPolicy is present, it must allow traffic to the gateway IP to ensure the ping succeeds. That's why the logical NetworkPolicy list exists.

@SkalaNetworks
Copy link
Contributor Author

@oilbeater do you hve an idea why not specifying a gateway in a YAML file doesn't trigger that crash but through the SDK it does? I can implement a simple fix based on a check for nil pointers, but I also want to understand why no one has noticed this bug before, am I calling the SDK wrong?

@oilbeater
Copy link
Collaborator

oilbeater commented Dec 18, 2024

@SkalaNetworks with this yaml and kubectl, kube-ovn-controller also crash. So I think is a common issue. In theory, when gateway IP changed the crash will happen.

apiVersion: kubeovn.io/v1
kind: Subnet
metadata:
  name: subnet1
spec:
  vpc: ovn-cluster
  default: false
  protocol: IPv4
  cidrBlock: 10.66.0.0/16
  natOutgoing: true

@oilbeater
Copy link
Collaborator

Sometimes the kube-ovn-controller can recover from crash, maybe related to some order of startup and subnet handle.

@SkalaNetworks
Copy link
Contributor Author

Yeah, I can't reproduce the bug very consistently with YAML files, but the SDK fails 100% of the time. This is strange. I pushed a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need backport subnet
Projects
None yet
Development

No branches or pull requests

2 participants