Description
Is your feature request related to a problem? Please describe.
Currently, with a time-based window, events are either grouped by key or partition during the period determined at the window's creation. The tumbling of that window occurs when a new event arrives and falls into a new window, according to partition/key rules.
Describe the solution you'd like
The time-based window should tumble independently of new messages, as messages with the same key or within the same partition may not arrive for a long time. The window should tumble solely based on the elapsed time it covers.
Solution: The tumbling_window
function (and all related functions) on a StreamingDataFrame
should accept an additional parameter to control its behavior—whether it tumbles at regular intervals or only when new events arrive.
Describe alternatives you've considered
Currently, I have implemented a memory/database-based solution for resilience to restarts, where one process collects new messages and another process validates the accumulated events within the window's time frame.
Another (less ideal / dirty) workaround is to post dummy messages to the topic at regular intervals to ensure the window tumbles, discarding these messages based on their content.
Additional context
Use case for this feature: grouping messages to start batch processing for all accumulated messages at once. Once the time frame has elapsed, the batch must be triggered, as there may be no further events for a while. Waiting for the next message could introduce unnecessary idle time, impacting performance.