issue: 4409403 Fix heap corruption since c73d96a #337

tomerdbz · 2025-04-22T12:20:13Z

Description

This PR fixes a critical heap corruption issue in TCP socket timer management
that was introduced in commit c73d96a.

What

Fix race condition between timer thread and socket deletion in TCP socket timers.

Why ?

A race condition was introduced in commit c73d96a that allowed the timer thread
to access socket objects after they had been deleted by the event handler thread,
causing heap corruption and crashes.

How ?

The fix improves synchronization between threads:

Removes sockets from timer collections while holding the socket lock
Creates a new deletion path that doesn't attempt to access timer collections
after socket cleanup
Additionally fixes a lock leak in the early return path of clean_socket_obj()

Change type

What kind of change does this PR introduce?

Check list

Code follows the style de facto guidelines of this project
Comments have been inserted in hard to understand places
Documentation has been updated (if necessary)
Test has been added (if possible)

Performance impact

We now hold the socket lock a bit longer during cleanup to remove it from timer collections
This is a one-time cost during socket closure, not during normal operation
Only affects the cleanup path, not data transmission or reception.

It can affect CPS tests, but preventing heap corruption far outweighs that minor performance impact imho.

How this bug was found?

sockperf UDP ping-pong mode recreated this issue - it was found during new config STP.
git bisect found the commit
Cursor helped with explaining the commit issues
bug found
sockperf didn't crash anymore :)

attaching a pic for a sequence diagram explaining the faulty flow that was fixed:

This commit fixes a critical race condition in timer management for TCP sockets that was introduced in commit c73d96a. The heap corruption was caused by a race condition between the timer thread and socket destruction. Sockets could be deleted by the event handler thread while still being referenced by the timer thread in the timer collections, resulting in heap corruption when the timer thread attempted to access the deleted memory. In the original implementation, sockets were removed from timer collections and deleted asynchronously without proper synchronization with the timer processing thread. Fix: - Remove sockets from timer collections while still holding the socket lock, guaranteeing the timer thread cannot access sockets marked for deletion - Create a simplified deletion path that doesn't attempt to access timer collections again after socket cleanup Additionally, as an unrelated improvement, this patch fixes a lock leak in the early return path of sockinfo_tcp::clean_socket_obj() where a lock was acquired but not released when a socket was already marked as cleaned. The heap corruption stemmed from a fundamental architectural change that separated socket objects from their timer management without providing proper synchronization for the distributed socket lifecycle. Signed-off-by: Tomer Cabouly <[email protected]>

pasis · 2025-04-23T10:56:34Z

As we discussed offline,
The "timer thread" shouldn't exist and running timers outside the internal thread (event handler thread) is problematic indeed. In the past, we removed such a flow when the tcp timer could run as part of sockinfo_tcp::tcp_con_unlock() causing issues.

First, we need to identify the flow to run tcp timer outside of the internal thread context and this flow will like need to be avoided.

Note, this comment assumes that delegated timers feature is off.

tomerdbz changed the base branch from master to vNext April 22, 2025 12:20

tomerdbz requested review from pasis and galnoam April 23, 2025 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

issue: 4409403 Fix heap corruption since c73d96a #337

issue: 4409403 Fix heap corruption since c73d96a #337

Uh oh!

tomerdbz commented Apr 22, 2025 •

edited

Loading

Uh oh!

pasis commented Apr 23, 2025

Uh oh!

Uh oh!

issue: 4409403 Fix heap corruption since c73d96a #337

Are you sure you want to change the base?

issue: 4409403 Fix heap corruption since c73d96a #337

Uh oh!

Conversation

tomerdbz commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What

Why ?

How ?

Change type

Check list

Performance impact

How this bug was found?

Uh oh!

pasis commented Apr 23, 2025

Uh oh!

Uh oh!

tomerdbz commented Apr 22, 2025 •

edited

Loading