Memory grows until OOM with slow telemetry collector and lots of data.

I am opening this issue to get an opinion on how to handle an issue I have observed. We had a telemetry collector (telegraf with custom gNMI plugin) that was running with limited CPU quota (I think limited to ~20% CPU time). The collector was using SAMPLE mode to get a large BGP table. Over time we noticed that the memory of the telemetry process was increasing until we hit OOM and the process was killed. 

I identified the issue in that when in client_subscribe.go send() function we call `err = stream.Send(resp)`. This call will actually block in the case I described above when the collector is not processing data quickly enough. The problem then is that the telemetry process will keep adding data to the PriorityQueue which causes the memory to grow. To rectify this issue, I introduced a new "LimitedQueue" instead of the current PriorityQueue in our (Dell) sonic-telemetry. The LimitedQueue will check the size of the Queue and reject adding newitems if the size is greater than the predefined maximum size (default I set to 100MB). 

This is working, however it means that the collector will start to miss telemetry updates. Recently Broadcom recommended instead I close the connection with gRCP code RESOURCE_EXHAUSTED instead of silently dropping updates.

Wanting to know what is the community preferred way to do this before opening a PR. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory grows until OOM with slow telemetry collector and lots of data. #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory grows until OOM with slow telemetry collector and lots of data. #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions