Description
This issue is based solely on observations and code study. As with any non-deterministic behavior, it can be difficult to pinpoint and correct so more eyes and validation would be useful.
PBnJ operates on requests asynchronously. When sending multiple, serialized requests to PBnJ it accepts them but has no context. As requests are serviced asynchronously its left to the go runtime to perform non-deterministic scheduling of the goroutines resulting in non-deterministic behavior. For example, sending a boot device request followed by a power on request can result in machines starting but not with the expected boot device.
I don't think this is a bug, I think its a design flaw. PBnJ would benefit from a deterministic API of sorts removing the need for consumers to synchronize their requests.
Possible Solution
(1) Introduce synchronous APIs. This would ensure RPCs don't return until the action has actually been carried out greatly improving the consumer experience.
(2) Offer an API that allows specifying multiple actions per request. This would allow PBnJ to operate on the actions asynchronously still but provide the context so the actions can be made serially.
(3) The most complicated. PBnJ could manage a queue per BMC. When a request is received and no queue is present for the endpoint create a new one and hold it in memory in an LRU cache, possibly with timeouts too (just to control the memory footprint more intentionally). Add the task to the queue and have a job management construct with N workers plucking from the various queues. You can't start the next item in the queue until the previous one for that queue has finished but you can pluck from other BMC queues. This hinges on some sort of static data for BMCs available in a request so you can lookup the queue. I suspect the IP address would be static enough given it makes little sense for BMCs to be dynamic (maybe people use a fancy dynamic DNS setup?). This is akin to a job management system.
Steps to Reproduce (for bugs)
Its a race condition so hard to reproduce. Run boot device and power on requests in that order enough times and you'll probably see it.
Context
On EKS-A we observed, when beginning the provisioning process, that machines booted into disks despite being asked to PXE boot. The BMC capability offered by EKS-A ensures we can provision machines even if they have an existing image on the OS.