-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Is your feature request related to a problem? Please describe.
Currently, we have the Erlang SDK running at scale in our services, tho, recently, our team start to face, what we believe to be an issue with the increased amount of feature flag evaluations we run in runtime. Our team started to notice some small crashes that happen with the SDK when some of the processes are too busy, either processing the messages on mailbox or communicating with LD APIs.
Most of the problems start with this crash:

So, what we observe here is (I'm using the module names for better reference) ldclient_event_server is attempting to synchronously talk to ldclient_event_process_server, tho that process is busy (and our guess is that it's sending a batch of events to LD server, but there's a huge stack of messages on the mailbox) and the call timeout, crashing the ldclient_event_server process. Okay, so, what's the main problem here?
ldclient_event_server is a singleton process, which means that, if it crashes, all other processes that are attempting to send messages to it will crash because the process no longer exists.
This is the root cause of the problem we're trying to solve here. What we expect from the SDK team is to handle these situations more gracefully, but not only that we would like to make some more suggestions.
Describe the solution you'd like
- Make
ldclient_event_serverhandle the genserver timeouts more gracefully OR make sure thatldclient_event_process_server:get_last_server_timeis implemented in a way that it's concurrently accessible, without the need to exchange messages between processes. If this is not possible for some criteria, maybe create an abstraction on top of this, in a way that we can choose the fidelity of the messages - Make
ldclient_event_process_serverandldclient_event_servera pool of processes. Currently these are single process, that are both responsible for provide and process data when evaluating all feature flags. In a highly concurrent application, these processes are for sure bottlenecks -- Specially if one of them can get blocked to synchronize data with the server - If
send_eventsis disabled, make sure that whenldclient.variationis called, don't exchange messages with any other processes that is not meant to, this just adds latency to the flag evaluation and reduce the concurrency.- To be more explicit, when the function is called, a message is sent to
ldclient_event_server, that message is processed with still process by that genserver
- To be more explicit, when the function is called, a message is sent to
Describe alternatives you've considered
Currently, our workaround in production environment was to disable the events synchronization with the server, which is not ideal as we would like to have the evaluation graphs populated.